STUDY OF AN ARTIFICIAL NEURAL NETWORK FOR THE CONTROL OF A CELLULASE PRODUCTION PROCESS

Size: px

Start display at page:

Download "STUDY OF AN ARTIFICIAL NEURAL NETWORK FOR THE CONTROL OF A CELLULASE PRODUCTION PROCESS"

Cameron Jones
6 years ago
Views:

4 COMILLAS PONTIFICAL UNIVERSITY ESCUELA TÉCNICA SUPERIOR DE INGENIERÍA (ICAI) INDUSTRIAL ENGINEER MASTER S THESIS STUDY OF AN ARTIFICIAL NEURAL NETWORK FOR THE CONTROL OF A CELLULASE PRODUCTION PROCESS AUTHOR: ADVISER: Hsu, Chau-Yun MÁLAGA, July, 2016

5 ii

6 RESUMEN ESTUDIO DE UNA RED NEURONAL ARTIFICIAL PARA EL CONTROL DE UN PROCESO DE PRODUCCIÓN DE CELULASA Autor:. Director: Hsu, Chau-Yun. Entidad Colaboradora: Tatung University. RESUMEN DEL PROYECTO Introducción Planteamiento del problema El presente proyecto surge de la necesidad de automatización de un proceso de fermentación controlada. Este proceso es parcialmente automático, pues existen algunos algoritmos simples y lazos de control, pero aún existen otros aspectos más complejos para los que no sirven estos métodos. En concreto, se necesita de un experto humano que continuamente monitorice las condiciones del experimento durante las primeras 48 horas para activar el mecanismo de alimentación. Este debe ser accionado tan pronto como los microorganismos se queden sin el alimento inicial, aunque ese momento no es fácil de detectar con los sensores analógicos que proveen de información al ordenador de control. Es por ello que se necesita al operario para que saque muestras periódicas del cultivo y mida el nivel de glicerol presente, que constituye la fuente de nutrientes de carbono de los microorganismos. Objeto del proyecto El objeto de este proyecto es estudiar la viabilidad de implementar un sistema inteligente de redes neuronales artificiales para la sustitución de dicho operario y así conseguir un proceso más automático. Este sistema debe prescindir de las medidas manuales de glicerol y basarse únicamente en las señales analógicas provenientes de los sensores, ahorrando así la tediosa tarea del operario y el dinero de las muestras. Este estudio pretende determinar si es viable crear tal sistema que pueda discernir el momento de activación de la alimentación, incluso pese a que la información no es completa y varía de experimento a experimento. Estado de la técnica Aunque el mundo de las redes neuronales artificiales lleva en pie más de medio siglo, no ha sido hasta hace poco que ha encontrado un desarrollo fuerte y extensa aplicación a nivel comercial. Este éxito es en parte debido a un tipo de red neuronal que ha mostrado ser muy potente y que ha encontrado campos de aplicación más amplios: el perceptrón multicapa [HUSH93]. Esta estructura está formada por unidades sencillas o perceptrones interconectadas entre sí que adquieren propiedades de sistema muy poderosas. Como iii

7 RESUMEN ejemplo, una red de dos capas de perceptrones sigmoidales es capaz de aproximar cualquier función continua [CYBE89]. Cada perceptrón basado en la estructura de las neuronas biológicas cuenta con varias entradas que pasan por una suma ponderada y por una función de activación no-lineal que genera una única salida. Las salidas de los perceptrones de una capa son las entradas de la capa siguiente y así se conectan todas las capas formando un flujo unidireccional de información que va desde las entradas a las salidas del sistema (red neuronal prealimentada). f(z) = f(w 1 x 1 + w 2 x w n x n + b) (1) Los perceptrones multicapa son comúnmente usados en aplicaciones de clasificación o regresión [HUSH93]. Como otras redes neuronales, se caracterizan por tener habilidades que no son estrictamente programadas. Estas, que residen en el valor de los pesos de entrada de los perceptrones, son adquiridas mediante aprendizaje supervisado, por el que la red es provista de un conjunto de ejemplos de entradas que tienen una salida asociada, que es la que la red debe aprender a generar [LIPP87]. Cada vez que el sistema recorre una de las entradas, crea una salida, esta es comparada con la ideal y la red ajusta sus parámetros internos para aproximarse al comportamiento deseado. Una vez que estos ejemplos han sido recorridos un gran número de veces, el sistema habrá mapeado las relaciones entrada-salida y será capaz de generar correctamente salidas para casos que no ha visto antes. Esto se consigue gracias a que las redes neuronales tienen las habilidades de generalización y abstracción [GOOD16]. Metodología Datos disponibles Este estudio dispone de los datos extraídos durante nueve procesos de fermentación llevados a cabos en diferentes condiciones. La duración de cada fermentación es de alrededor de 300 horas, aunque para el estudio en cuestión solo son necesarios los datos hasta el momento en que se activó la alimentación de las células, que ocurre entre 20 y 40 horas después del comienzo del proceso. Los datos de las 10 primeras horas son también descartados, ya que corresponden a la estabilización inicial del sistema. Cada experimento contiene los datos de tres variables provenientes de sensores analógicos: el ph, el oxígeno disuelto (DO), en %, y el potencial oxidaciónreducción (ORP), en mv. Los datos de estos sensores fueron registrados durante los experimentos una vez cada 5 minutos. Entre las señales que el operario usa para detectar el momento de alimentación, se encuentran la subida del ph hasta 5,60 y una subida notable del DO. Estos dos eventos, sin embargo, no son necesarios ni suficientes para determinar la necesidad de más alimento para las células. Algoritmos empleados Para este estudio se ha usado un perceptrón multicapa con una capa oculta y una capa de salida. El número de entradas en la red neuronal es siete, incluyendo las tres variables analógicas y cuatro alteraciones de estas variables tras haber aplicado filtros que destacan sus propiedades diferenciales o su tendencia; en iv

8 RESUMEN concreto, el ph, el ph con un filtro diferencial y alisado exponencial, el DO, el DO con ambos filtros, el ORP, el ORP con filtro diferencial y el ORP con ambos filtros. La salida de la red es una variable en el rango [ 1, +1], que representa la decisión de activar el mecanismo de alimentación (+1) o de no hacerlo ( 1), con valores intermedios representando diferentes niveles de confianza en la decisión. Los procesos de fermentación de ejemplo tienen asociados una salida ideal de 1 en todos sus puntos, a excepción de los últimos antes de que el operario accionase la alimentación. El objetivo de la red neuronal es clasificar correctamente los puntos en uno de los dos valores. Un valor umbral es elegido a posteriori, que determina el valor discriminatorio por el que los puntos son separados en una clase o en la otra. El algoritmo de aprendizaje usado es la propagación hacia atrás de errores, con aprendizaje competitivo y regularización L2. La función de coste usada es el error cuadrático medio y la función de activación de las neuronas es la función sigmoidal extendida al rango [ 1, +1]. 2 σ(z) = 1 (2) 1 e z Los hiperparámetros ajustables para el entrenamiento son el número de neuronas de la capa oculta H, el coeficiente de aprendizaje µ, el parámetro de regularización λ y el número de veces que se recorren todas las entradas (o épocas). Para el entrenamiento de la red neuronal se han usado cinco de los nueve procesos de fermentación, también llamado el conjunto de entrenamiento; otros dos fueron usados para ajustar los hiperparámetros de manera que se asegure la capacidad de generalización de la red, el conjunto de validación; otro fue apartado y usado al final para evaluar el sistema con datos completamente nuevos, el conjunto de evaluación; y un último fue descartado al haber tenido lugar bajo condiciones muy diferentes al resto. Resultados Tras una serie de pruebas, errores, ajustes de parámetros y algoritmos del sistema, el sistema final es capaz de diferenciar correctamente el punto de alimentación en todos los experimentos, tantos los de entrenamiento como los de evaluación. El error más grande que el sistema comete es de 0,3, lo cual es muy positivo teniendo en cuenta que la diferencia en la salida ideal entre las dos clases es de 2. En la Figura 1 se puede apreciar una gráfica que muestra el valor de la salida del sistema en los cinco experimentos de entrenamiento. Nótese que todos los puntos son clasificados como 1 a excepción de los del final de cada experimento, que corresponden al momento de comienzo de la alimentación. Los márgenes de tiempo de todos los experimentos se encuentran dentro de lo aceptable, ya que son inferiores a 5 horas, valor fijado por el operario como máximo aceptable. Los hiperparámetros usados para el entrenamiento del sistema final son los mostrados en el Cuadro 1. Es importante remarcar que los resultados, pese a ser positivos, no tienen completa validez, pues el número de experimentos de los que se disponen datos es bastante reducido y no es una muestra significativa del total de la población. Así, es posible que el sistema desarrollado haya aprendido en exceso los detalles v

9 RESUMEN Figura 1. Salida del sistema final en los experimentos de entrenamiento. Entradas H µ λ Épocas Cuadro 1. Hiperparámetros de entrenamiento del sistema final. de la muestra disponible y falle al exponerse a nuevos experimentos; es decir, no se pueden garantizar las capacidades de generalización del sistema. Un signo que apunta a ello es lo impecable de los resultados de clasificación cuando existe mucho error aleatorio en las entradas. Como alternativa al sistema final se propone un segundo cuyo entrenamiento solo difiere al anterior en el parámetro λ, que ahora vale Este sistema, pese a no ser capaz de clasificar uno de los experimentos (el octavo) 1, sí que refleja niveles de confianza variables más acordes a los ruidos aleatorios de las entradas (ver Figura 2) y, por tanto, ostenta mejores capacidades de generalización. El margen de clasificación de este segundo sistema es de 0,8, por lo que, si se coloca el umbral en el punto medio de este margen, habrá una distancia de 0,4 con respecto al punto más cercano de cada una de las clases, lo que es un margen aceptable. Conclusiones Los resultados apuntan a que es posible detectar el punto de activación de la alimentación de un proceso de fermentación pese a disponer de información sesgada. No obstante, estos resultados no son concluyentes, ya que la muestra de experimentos de fermentación disponibles es bastante reducida. El sistema inteligente podrá ser revisado y reevaluado una vez que se dispongan de más muestras de datos para poder mostrar su eficacia en un mayor número de situaciones no vistas anteriormente. Referencias [BENG12 ] Bengio Y., Practical Recommendations for Gradient-Based Training of Deep Architectures, arxiv v2. September Figura 2. Alternativa de sistema final. Esta versión, pese a no clasificar bien uno de los experimentos (el octavo), presenta mejores propiedades de generalización. 1 El experimento en cuestión presenta condiciones de clasificación muy difíciles y no es seguro que sea relevante. vi

10 RESUMEN [BENG09 ] Bengio Y., Learning Deep Architectures for AI, Foundations and Trends R in Machine Learning, Vol.2, No.1, c 2009 DOI: / [CYBE89 ] Cybenko G., Approximation by Superpositions of a Sigmoidal Function, Math. Control Signals Systems (1989) 2: [DEEP13 ] Bengio Y., Deep Learning of Representations: Looking Forward, ar- Xiv v2. June [GOOD16 ] Goodfellow I., Bengio Y., Courville A., Deep Learning, Book in preparation for MIT Press [HUSH93 ] Hush D.R., Horne B.G., Progress in Supervised Neural Networks, What s New Since Lippmann, IEEE Signal Processing Magazine. January [LECU98 ] LeCun Y., Bottou L., Orr G., Müller K., Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, [LECU15 ] LeCun Y., Bengio Y., Hinton G., Deep Learning, Determination Press, [LIPP87 ] Lippmann RP., An Introduction to Computing with Neural Nets, IEEE ASSP Magazine /87/ , April [MITC97 ] Mitchell T., Machine Learning, [NIEL15 ] Michael A. Nielsen, Neural Networks and Deep Learning, Determination Press, [SCHM15 ] Schmidhuber J., Deep Learning in Neural Networks: An Overview, Neural Networks, Volume 61, pages January DOI: /j.neunet vii

11 RESUMEN viii

12 ABSTRACT STUDY OF AN ARTIFICIAL NEURAL NETWORK FOR THE CONTROL OF A CELLULASE PRODUCTION PROCESS Introduction Problem introduction The study at issue was motivated by the need of automating a controlled fermentation process. This process, which is only partially automatic, has some complex phenomena that cannot be easily controlled by a simple algorithm or close loop. In particular, a human expert is needed for the continuous supervising of the experiment conditions during the first 48 hours in order to activate the feeding supply. This activation needs to happen as soon as the microorganisms run out of the initial food supply, although that moment is not easily detectable by the analog sensors that the central computer uses for monitoring the process. The human expert needs then to periodically withdraw small broth samples and manually measure from them the level of glycerol, which is the main carbon source for the cells. Objectives of this study The goal of the study is evaluating the feasibility of implementing an intelligent artificial neural network that is able to substitute the human expert and therefore let the process be fully automatized. This AI system cannot have any manual glycerol measurement available and shall only use the information coming from the analog sensors. Skipping the manual measurements saves a lot of tedious work from the human expert and the money of the operations. This study tries to determine whether or not it is feasible to design such a system that is able to recognize the moment at which the feeding should be activated, even when the information available is not complete and varies from experiment to experiment. State of the arts In spite of having existed for around half a century, only recently were artificial neural networks strongly developed and commercially applied. This success is partly owed to one type of neural network that proved to be really powerful and can be used for wide spectrum of applications: the multi-layer perceptron [HUSH93]. Its structure is formed by simple units perceptrons that are interconnected forming a network with powerful properties. For instance, a two-layer network of a finite number of sigmoid perceptrons is able to approximate any continuous function [CYBE89]. Each perceptron, which is based on the biology neurons, has a few inputs that are weighted summed and passed through a non-linear activation function that generates one single output. The outputs of the neurons of one layer are the inputs for the next layer, so the information only flows in one direction from the inputs to the output neurons (feed-forward neural network) ix

13 ABSTRACT f(z) = f(w 1 x 1 + w 2 x w n x n + b) (1) Multi-layer perceptrons are commonly used in classification or regression problems [HUSH93]. As it happens with other neural networks, their abilities are not strictly programmed, but acquired through supervised learning, a method that provides the network a sample data set that comes with the desired output that the system should generate [LIPP87]. Every time the network takes an input sample point, it generates an output, compares it with the desired one and then adjusts the weights of its perceptrons, whose values determine the network behavior. Once all the training samples have been checked for a certain number of times, the system will have created an input-output mapping and will be able to perform on data from which the desired output is not known anymore. Performing well on the new data is only possible due to the network generalization and abstraction capabilities [GOOD16]. Methodology Available data This study counts with data from nine fermentation processes that took place at different conditions. Although each fermentation lasts around 300 hours, only the data until the time the feeding was activated is needed, which happens from 20 to 40 hours since the beginning. The data at the very first 10 hours of each experiment are also discarded, as it is the time it takes for the growth medium to stabilize. Each experiment data set contains data coming from three analog sensors: the ph, the dissolved oxygen (DO), in %, and the oxidation-reduction potential (ORP), en mv. The data from these three sensors is registered once every five minutes. Among the cues the human expert uses for detecting when to activate the feeding, there are a raise in the ph up to 5.6 and a raise in the DO level. Nevertheless, none of these events is neither necessary nor sufficient condition for detecting the start feeding point. Algorithms This study uses a multi-layer perceptron with a single hidden layer and an output layer. The number of inputs of the network is seven, including the raw data from the three sensors and four derived variables after applying filters that highlight their differential properties or trends; namely, the ph, the ph after applying a differential and an exponential smoothing filters, the DO, the DO with both filters, the ORP, the ORP with differential filter and the ORP with both filters. The output of the network is inside the range [ 1, +1], it representing making a feeding decision (+1) or not doing it ( 1), with many levels of confidence in between. Each data point available from a sample experiment has a desired output of 1 associated, except from the last points of each experiment, which have a +1. The goal of the network is correctly classifying the points of the two groups. A threshold value, which determines the sharp value for separating both categories, is determined a posteriori. The learning algorithm employed is back-propagation error with competitive learning and L2 regularization. The system uses the mean squared error as x

14 ABSTRACT Figure 1. Output of the final system on the experiments of the training set. inputs H µ λ epochs Table 1. Hyper-parameters for the training of the final system. the cost function and the sigmoid function as the activation function for the perceptrons, although extended to the [ 1, +1] range. 2 σ(z) = 1 (2) 1 e z The hyper-parameters that determine the training conditions are: the number of hidden neurons H, the learning coefficient µ, the regularization parameter λ and the number of times all the input data is used (epochs). Five of the nine experiments available have been used for directly training the network (training set), another two were used for adjusting the hyper-parameters in order to ensure good generalization (validation set), another one was put away and only used at the end for evaluating the system performance on new data (test set), and the last one was discarded, as it was performed under quite different conditions than the others. Results After some trials, errors and variations of the system, the final network is able to correctly classify the feeding point of all experiments from the three sets. The biggest error in the output has a value of 0.3, which is quite low if compared to the difference in the output between both classes, with a value of 2. A graphic with the value of the output in the five experiments of the training set is displayed at Figure 1. Notice that all points have an output of almost 1, with the exception of the last points at each of the experiments, which correspond to the time when the feeding should be activated. All time margins have acceptable values, since they are below 5 hours, which is the maximum margin stated by the human expert as acceptable. The hyper-parameters being used for training this final system are shown in Table 1. It is important to remark that even though the results are positive, they are not fully valid, since the number of experiment samples available is quite small and cannot fully represent the total population. For this reason, it is possible that the network learned too much the details of the available data and fails when applied to new experiments. Its generalization abilities cannot be properly guaranteed. Something that may point to that flaw is the fact that the output is quite clean, even though there is a lot of noisy error at the inputs. An alternative system with the only change of having in the training a λ = 10 4 is proposed. Although this one cannot correctly classify one of the experiments xi

15 ABSTRACT Figure 2. Alternative final system. Even though EXP8 is not correctly classified, the output shows better generalization properties. (EXP8) 1, it holds an output with more realistic confidence variations according to the noisy errors (Figure 2) and therefore better generalization properties. The classification margin for this second system is 0.8, so, if the threshold value is placed in the middle, there would still be a distance of 0.4 to the closest point of each of the two classes, which is an acceptable margin. Conclusions The results suggest that it is possible to detect the feeding activation point of a fermentation process with limited information. Nevertheless, these results are not fully conclusive, as the number of experiment samples available is not enough. The intelligent system developed at this study can be revised and tested again once a bigger sample set (maybe twice or thrice its size) is provided. Then, it will be possible to have more confidence on its ability to perform well in new situations. References [BENG12 ] Bengio Y., Practical Recommendations for Gradient-Based Training of Deep Architectures, arxiv v2. September [BENG09 ] Bengio Y., Learning Deep Architectures for AI, Foundations and Trends R in Machine Learning, Vol.2, No.1, c 2009 DOI: / [CYBE89 ] Cybenko G., Approximation by Superpositions of a Sigmoidal Function, Math. Control Signals Systems (1989) 2: [DEEP13 ] Bengio Y., Deep Learning of Representations: Looking Forward, arxiv v2. June [GOOD16 ] Goodfellow I., Bengio Y., Courville A., Deep Learning, Book in preparation for MIT Press [HUSH93 ] Hush D.R., Horne B.G., Progress in Supervised Neural Networks, What s New Since Lippmann, IEEE Signal Processing Magazine. January The classifying conditions of this experiment are quite difficult and it is not even sure the relevance of this experiment. xii

16 ABSTRACT [LECU98 ] LeCun Y., Bottou L., Orr G., Müller K., Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, [LECU15 ] LeCun Y., Bengio Y., Hinton G., Deep Learning, Determination Press, [LIPP87 ] Lippmann RP., An Introduction to Computing with Neural Nets, IEEE ASSP Magazine /87/ , April [MITC97 ] Mitchell T., Machine Learning, [NIEL15 ] Michael A. Nielsen, Neural Networks and Deep Learning, Determination Press, [SCHM15 ] Schmidhuber J., Deep Learning in Neural Networks: An Overview, Neural Networks, Volume 61, pages January DOI: /j.neunet xiii

17 For Taiwan and all its people, who made me have such a joyful year.

18 Acknowledgments I would like to give special thanks to Prof. Hsu ( 許超雲 ), who has advised my thesis and from whom I have learned a lot. Not only did he introduce me to artificial neural networks, but also to the tea world and Taiwanese culture. I must also thank Prof. Chen ( 陳志成 ) and Dr. Huang ( 黃丁晏 ) for their collaboration and time. I really wish them the best for the success of their researches. Finally, I also want to thank the department personnel, the head of the department Prof. Huang ( 黃淑絹 ) and my master friends from EEE, who always have supported me. All the staff from both Tatung University ( 大同大學 ) and Comillas Pontifical University also deserve my appreciation, for making all this happen. xv

19 AKNOWLEDGMENTS xvi

20 Table of contents I. Dissertation 9 1. Introduction The decomposition of cellulose What is cellulase? Types of cellulase The cellulase production process Phases of the fermentation process Monitoring the medium Analog variables Manually measured variables Automatic control Carbon feeding procedure Literature review Artificial intelligence and machine learning Machine learning and traditional programming Artificial Neural Networks Perceptron Multi-layer perceptron Gradient descent Back-propagation Complete gradient descent algorithm Other details of the training process Problems of learning with multi-layer perceptrons Variation on the algorithm Methodology Objectives and requisites Data collection Algorithms Description of the work First prototype of AI system Concept exploration: Choosing the features System definition Training the system Results from first prototype Second prototype of AI system Concept exploration: Choosing the features System definition Training the system Be ware of confusing it with cellulose. 1

21 AKNOWLEDGMENTS TABLE OF CONTENTS 5. Results System results Training & ANN parameters Output results Discussion about the results Further investigation Conclusions 65 References 66 II. Source Code Program Code 1.1. Header Main Initialize File Names Inialize Parameters Read Input Train System Initial Neural Weights Back Propagation Activation Function Load Weights Cross Test Save Results III. Appendixes 87 A. Experiments Raw Data 89 A.1. Experiment 1 A.2. Experiment A.3. Experiment A.4. Experiment A.5. Experiment 5 A.6. Experiment A.7. Experiment A.8. Experiment 8 A.9. Experiment B. Features for the Second Prototype B.1. Feature 1: Raw ph B.2. Feature 2: Differential ph B.3. Feature 3: Differential smoothed ph B.4. Feature 4: Raw DO B.5. Feature 5: Differential DO B.6. Feature 6: Differential smoothed DO B.7. Feature 7: Raw ORP B.8. Feature 8: Differential ORP B.9. Feature 9: Differential smoothed ORP

22 List of Figures Types of Cellulase Phases of a bulk fermentation process. [5] Fermentation Control System Supervised machine learning Generalization Structure of a perceptron Sigmoid Function Multi-layer perceptron First-order approximation Learning coefficient values Error function of a MLP, with steep slopes and flat areas Generalization vs. capacity First derivative of the sigmoid function First prototype s inputs and desired output First prototype s architecture Outputs of initial system New desired output with no flat parts Results of the system with ph when applied to EXP Results of the 10-neurons system (up) and the system with the new 48 scaling (down) Results of the system with the new scaling on an experiment it had never seen before from the test set Results of the system after applying regularization with λ = Performance of the final system at EXP 1, from the test set Results of the final system on the training set Results of the final system on the validation set Results of the final system on the test set Output of the system trained with λ = Raw data from Experiment Raw data from Experiment Raw data from Experiment Raw data from Experiment Raw data from Experiment Raw data from Experiment Raw data from Experiment Raw data from Experiment Raw data from Experiment

23 List of Tables 1. Comparison of information processing approaches Hyper-Parameters and results for the first trainings PI for first trainings Extended PI Training with decreasing number of hidden neurons Information of the seven features chosen for the second prototype PI for trainings with extended EXP8 x New scaling parameters for having inputs in the range [ 1, +1] PI for trainings with the new input and output scalings Results of trainings with increasing network size Results of trainings with increasing learning coefficient Results of trainings with increasing learning coefficient Parameters of last system Weights of last training

24 Acronyms ANN AI BP C/N DCW DO FNN HCl MLP MSE NaOH NN ORP PI RNN SGD Artificial Neural Network Artificial Intelligence Back-Propagation Carbon-Nitrogen source Dried Cell Weight Dissolved Oxygen level Feedforward Neural Network Hydrochloric Acid Multi-Layer Perceptron Mean Squared Error Sodium Hydroxide Neural Network Oxidative Redox Potential Performance Index Recurrent Neural Network Stochastic Gradient Descent 5

25 AKNOWLEDGMENTS ACRONYMS 6

26 Simbols α Momentum coefficient β Slope parameter of the sigmoid function Difference operator δ Layer error λ Regularization parameter µ Learning rate Vector differential operator Partial derivative σ Sigmoid function 7

27 AKNOWLEDGMENTS SIMBOLS 8

28 L PART I DISSERTATION

30 Chapter 1 Introduction T HE topic of this thesis was suggested by the Bio-Engineering Department 2 of Tatung University 3, which found in their cellulase production experiments a good opportunity for applying the knowledge in Artificial Neural Networks (ANN). Due to the multidisciplinary aspects of this collaboration, there have been additional major challenges, as our knowledge in biology is at least as poor as theirs in Artificial Intelligence (AI). The experiments to which the work of this thesis is applied have great complexity and have been in progress for a few years already. Getting updated with its evolution and bringing mutual understanding have cost some extra time and there have been quite a few steps in wrong directions before getting to the first results. For this reason, special efforts have been employed in this thesis for explaining and translating the complex biology terminology in a simple, sufficient way for laypersons. For further details and deeper explanations, the reader can consult the reference bibliography [1, 2, 3, 4, 5, 6, 7]. 1.1 The decomposition of cellulose Cellulose is the most abundant organic compound in the world, since it is a polysaccharide that forms the skeletal basis of plant cell walls and algae. Each cellulose polymer is formed by a long chain of monosaccharide glucose, which could be used as food, bio-fuel and chemicals if decomposed. Due to its massive renovation and near-inexhaustibleness, the bio-fuel from cellulose would become a great alternative to the fossil fuel sources that have become a scarce resource of energy upon its intensive use. Other biomass materials (starch, sugar, wheat, etc.) can also be transformed into sugars by a considerably easier hydrolysis 4 process. However, their use for this purpose is controversial since they can also be used as a food source and it would conflict with the efforts of fighting against hunger. Cellulose, on the contrary, is non-edible and huge amounts of it can be extracted from agricultural wastes and forest residues, which are practically useless and abundant. 2 生物工程系 3 大同大學 4 Hydrolysis is a process in which chemical binds are broken by enzymes and water molecules. 11

31 I. DISSERTATION 1. INTRODUCTION Nevertheless, all these benefits can only be achieved by the hydrolysis of cellulose into glucose, which is still at an experimental phase. The difficulties come from the strong binding in the cellulose polymers, which are also intertwined one to another forming crystalline structures. The decomposition can be carried out by thermal (combustion, pyrolysis, gasification, supercritical water) or thermo-chemical (acid, alkali) methods. However, most of these are inefficient or generate unwanted wastes. Enzymatic degradation with cellulase is the most effective means of degrading cellulose into useful components and it is thus under investigation. The experiments of this study are aimed at producing the cellulase that allows this biological degradation to occur What is cellulase? 5 Cellulase refers to a group of enzymes which catalyze the decomposition of cellulose into glucose units by a process called cellulolysis 6. Breaking down cellulose is more difficult than other poly-saccharides, since native crystalline cellulose is highly dense and complex, resulting in strong bindings. Cellulase enzymes, most prevalent in fungal and microbial sources, catalyze this process allowing it to happen in a natural way. For instance, ruminants contain certain bacteria in their digestive system able to produce cellulase allowing them to digest the cellulose in the plants. Termites also contain microorganisms producing cellulase and hence their ability to decompose the wood Types of cellulase There are three main types of cellulases: endocellulases, exocellulases and β- glucosidases 7 (refer to Figure 1). While the first two help braking the structures of cellulose in higher levels (breaking the big chains into smaller pieces), the latter is the one responsible for breaking it into individual monosaccharides and is the one being produced at the experiments of this study. 1.2 The cellulase production process The experiments to which the study of this thesis is applied consist of the fermentation process of the yeast Pichia Pastoris and the production of cellulase by this fungus. The growth occurs in batch culture and the duration of each batch is around 300 hours. For ensuring an optimal fermentation process, the bulk growth of the microorganisms must occur in a controlled culture medium that is constantly monitored Phases of the fermentation process The fermentation process counts with four phases (refer to Figure 2) which reflect the state of the microorganisms and the activity in which they are focused. Although these four phases are not well-defined, it is important to estimate in which state the microorganisms are so the ideal conditions of the medium are 5 Be ware of confusing it with cellulose. 6 Cellulolysis is a hydrolysis process applied to the breaking of cellulose bindings 7 Also known as cellobiases 12

32 I. DISSERTATION 1. INTRODUCTION Figure 1. Types of Cellulase [8]. Endocellulase breaks the crystal into single polymers, exocellulase chops the polymers into smaller chunks and the cellobiase breaks them into glucose units. provided. The four phases are the lag phase, the exponential phase, the stationary phase and the death phase. Lag phase The first stage of the fermentation process. The growth of the microorganisms does not start until they adapt themselves to the environment conditions and get fully active. With the abundance of oxygen and nutrients they will mature and eventually start dividing and reproducing, entering in the next phase. Exponential (log) phase The microorganisms, fully active now, reproduce themselves and grow in an exponential fashion. The reason for this comes from the asexual reproduction of cell doubling, by which each cell is divided into two new ones (mitosis). Then, the number of new cells appearing per unit of time is proportional to the current population and the growth becomes exponential. For maintaining their fast growth there must be availability of nutrients and oxygen, which will be constantly consumed. The acidity of the medium will also increase as a consequence of the organic acids generated in the growth process, worsening the growing conditions. Stationary phase The net growth of microorganisms slows down as nutrients become scarcer and wastes inhibitory. Cell growth rate equals cell death rate and the population remains constant. As a consequence of this scarcity, the microorganisms now employ their activity in producing proteins instead of duplicating, so they shouldn t have a big amount of nutrients available anymore; if a lot of nutrients were added, they would keep growing instead of producing the proteins. Death phase Microorganisms life comes to an end and the number of cells starts to decrease as the environmental conditions get too degraded. The 13

33 I. DISSERTATION 1. INTRODUCTION Figure 2. Phases of a bulk fermentation process. [5] microorganisms stop producing the proteins and the experiment can be concluded at this point Monitoring the medium The success of the microorganism culture strongly relies on the condition of the growing medium, so a constant monitoring must be deployed. For doing so in the concerning experiments, there are nine variables being observed; three of them by analog sensors and six of them manually measured by extracts that withdrawn from the fermenter every 6 hours Analog variables The three variable measured by sensors are the ph, the dissolved oxygen level (DO) and the oxidation-reduction potential (ORP). ph It represents the level of acidity of the medium which is the amount of hydrogen ions present in a solution. When it comes to cell culture, ph is one of the key variables, since it will strongly influence the survival of the microorganisms and their growth. It is measured with a ph meter and in a scale that goes from 0 to 14; being 0 strongly acidic, 14 strongly basic and 7 neutral. Yeasts grow best in a neutral or slightly acidic ph environment. DO (%) The dissolved oxygen level is the amount of oxygen dissolved in a fluid. It also has great effect in cell growth and product formation, since yeasts either need oxygen for aerobic cellular respiration or are anaerobic, but also have aerobic methods of energy production. It is measured in percentage (%). ORP (mv) The oxidation-reduction potential (or simply redox potential) measures the tendency of a solution to acquire electrons and be reduced; so a highly positive potential would entail a big attraction of electrons and vice versa. It is used for monitoring alterations in a system (metabolic changes in 14

34 I. DISSERTATION 1. INTRODUCTION the yeast, in this case) and it is measured in milivolts (mv). It is important to remark that there is a weak correlation between the ORP and the DO. Although the three analog sensors are measuring the conditions of the experiment in real time, the data is registered in the system only once every 5 minutes. This relatively low frequency has been chosen because of three reasons. 1. There is considerable noise in the analog signals due to factors irrelevant to the experiments (such as unstable energy source, instrument imperfections, etc.). In order to eliminate it, the value registered every five minutes is the mean value of the sensor during that period of time. 2. The computer that is monitoring the readings has limited memory capacity and there are risks of it shutting down due to overcharge. Every time this occurs, the growing conditions get away from the ideal point and uncontrolled situations that may threat the validity of the whole experiment appear. A high data registration frequency would increase the risks of this incident happening. 3. The amount of data output of each experiment is already quite bulky. By raising the data registration frequency, the size of the data will increase proportionally and it will be harder to manage and visualize with commercial programs. It may seem that a significant amount of data is lost by this decision. This is not true though, since the time constants of the microorganisms metabolism is orders of magnitude higher than the data registration frequency, so it should be high enough for capturing the slow metabolic changes Manually measured variables The rest of the variables are all measured on a small sample that is manually extracted from the fermentation experiment. In most cases, the measurements occur every 8 hours for the first 48 hours and once a day for the rest of the time 8. The six manually measured variables are the glycerol concentration, the ammonia concentration, the dried cell weight (DCW), the pnpg activity, the pnpg specific activity and the total proteins concentration. Glycerol concentration (g/l) It is the main source of carbon and nutrients for the cells. Nevertheless, a high concentration of this component would negatively affect the optimal growth and protein segregation of the microorganisms, so values close to zero are preferable. Ammonia concentration (g/l) Ammonia is a waste component resulting from a too high nitrogen feeding amount to the cells. As it happens with glycerol, a high concentration of ammonia would negatively affect the microorganisms living condition and should be avoided. DCW (g/l) The dried cell weight is a direct measurement of the biomass, or amount of microorganism concentration. By looking at its progression through time it is easy to estimate at which fermentation phase the experiment is. 8 The reason for this rather low frequency is rooted on the relatively high costs derived from making each of the measurements. 15

35 I. DISSERTATION 1. INTRODUCTION pnpg activity (U/ml) The pnpg activity (or enzyme activity) is directly related to the production of cellulase and it is therefore the goal variable. The higher the value of this variable is, the more successful the experiment will be. pnpg specific activity (U/mg) Similar to the pnpg activity, but instead of giving information of the total activity in the experiment it reflects the activity per cell and it is therefore a hint of the cells efficiency at producing the cellulase. Total proteins (mg/l) It measures the concentration of proteins in the experiment and therefore it gives an idea of the microorganisms production activity. Since the cellulase is one of the proteins produced, knowing the total concentration of them gives us a clue of the amount of cellulase achieved Automatic control For ensuring the optimal conditions for the cells growth and activity, the medium has to be controlled (refer to Figure 3); in particular, the ph and the DO must be closely monitored. For doing so, a closed loop strategy is implemented by adding actuators and control algorithms to the mentioned sensors. ph The ph level is measured by the ph meter and this value is then transmitted to the ph controller. This one compares the current ph level with the desired one (manually set by the user) and then injects hydrochloric acid (HCl) or sodium hydroxide (NaOH) in order to raise or lower the acidity of the medium, respectively. There is a band of tolerance of 0.2, meaning that as long as the ph level is not further than 0.1 to the desired one, the ph controller will not actuate. In the experiments of this study the desired ph level will be most of the times 5.5, so only when the ph level goes below 5.4 or above 5.6 the ph controller will inject NaOH or HCl for bringing it back to the optimal band. DO The DO level is measured by a DO sensor and the signal is transmitted to the central computer. It will then compare whether the current DO value is lower than a certain threshold and, in case it is below, it will open the valve of the pure oxygen tank. It is important to note that the feeding capacity of the oxygen tank is limited and therefore there may be times in which the DO controlling system is not able to rise the DO level as the oxygen consumption is higher than the feeding rate. In the experiments of this study the DO minimum threshold is 20% and any time a value below it is read, the valve will be opened until a value above 20% is read again Carbon feeding procedure There is one more actuator that will control the medium condition of the experiment: a pump feeding the cells with the carbon and nitrogen source (C/N). Such solution contains the glycerol that will provide carbon to the microorganisms and the gloutamine acid that provides the nitrogen. There is not direct algorithm allowing to build a close loop for controlling the feeding of the C/N source, so the C/N feeding values can only be decided by human judgment 9. 9 with the introduction of human mistakes that this entails 16

36 I. DISSERTATION 1. INTRODUCTION Figure 3. Fermentation Control System. While the ph and the DO are controlled by closed loop control systems, the carbon/nitrogen source is controlled manually by the user. The rate at which the C/N tank feeds its content is fixed at g/s and the user decides how long it will be open at each feeding period (the feeding time) and how long a period is (the period time), conforming two degrees of freedom. Typical period times in the experiments of this study are 1200 and 1600 seconds and feeding times within a period vary from 0 to 80 seconds. While the period time can be fixed to a single value for simplicity (as one degree of freedom is enough), the decision for the feeding time has to be calculated from the glycerol consumption observed after a few manual measurements of the glycerol concentration. The C/N feeding does not start from the beginning of the experiment, since at the very first some germ oil is added to the solution providing nutrients for the initial growth. For this reason, it is also left to human judgment the decision to start the C/N feeding, which should ideally coincide with the point of exhaustion of the carbon source in the solution. The human expert uses a few hints that allow him to guess that this point was reached, which are the following: There is a raise in the ph. When the microorganisms are active and consuming glycerol, they produce organic acids that constantly lower the ph of the medium. Since the ph controlling system does not allow a ph lower than 5.4, it will remain at that level during the growing of the cells. When the cells stop to grow due to the lack of carbon source, the acidity will stop to increase and therefore the ph level will be able to rise from 5.4 up to 5.6, so higher ph values can appear because of an exhaustion of the nutrients. 17

37 I. DISSERTATION 1. INTRODUCTION There is a raise in the DO. When the microorganisms are active and consuming glycerol, they also consume oxygen. The DO controlling system will constantly try to counter this consumption by replenishing with pure oxygen. When the carbon source is exhausted and the cells lower their activity, the DO injection has no countering force and therefore a sharp increase in the DO level appears. There is a change in the ORP. The ORP gives a hint of the metabolic condition of the cells. When there is a big change in the metabolic condition, it may be reflected by an abrupt change in the ORP value. The glycerol concentration is low. The glycerol is itself the source of carbon for the microorganisms, so a low glycerol concentration is a necessary condition for starting the C/N feeding. However, while the previous three conditions rely on analog variables that are constantly being measured by analog sensors, the glycerol concentration is manually measured at most once every 8 hours and therefore its value is not usually available for making a decision. It is important to note that there is no single parameter that is definite itself for making a start feeding decision, since the experimentation conditions change in every experiment as different medium and strategies are applied. That is the reason for not existing a single algorithm that is able to work for all cases. 18

38 Chapter 2 Literature review A RTIFICIAL Intelligence (AI) has seen a strong renaissance in the last decade and it is more and more present in our everyday life: web searches, personalized recommendations, natural speaking interfaces or image recognition all use state-of-arts AI technologies. Computer vision and natural language processing have remarkably overcome all previous performance thanks to the big achievements in deep learning and its success has reached the broad public due to the large investments made by the Internet giants: Google, Facebook, Microsoft, Apple, IBM, Baidu, Yahoo, Twitter, etc. On other fields, artificial intelligence is also breaking new records and reaching milestones, as the computer system AlphaGo has beaten the world championship in GO considered to be one of the most complex existing board games and a human fortress against computers or the self-driving cars that are arriving at the streets. There is no doubt that it is a field with great potential, both short and longterm. However, despite of its increasing relevance, it is still broadly unknown by the general public, partially because of its high mathematical complexity. For this reason, a brief introduction about AI and machine learning will be included before going deeper with Artificial Neural Networks, the tool used in this study. 2.1 Artificial intelligence and machine learning Artificial Intelligence can be defined as the science and engineering of making intelligent machines, especially intelligent computer programs 10. And the next question arising immediately is: How to define an intelligent machine? An intelligent machine is a flexible one that is able to perceive the environment and take the most appropriate actions for attaining a goal. These are actually some of the features that define a living being: responding to stimuli and adapting to the environment. Life beings associate responses with stimuli and create a mapping function between them, f : X Y. Intelligent machines should also be able to model the world in which they operate and create those mappings. This is the case of neural and fuzzy systems, since both can adaptively estimate continuous function from input data, which is done in a flexible way and without previously mathematically specifying how outputs depend on inputs. The goals for which these machines are designed can vary from processing natural language, 10 John McCarthy, What is Artificial Intelligence? 19

39 I. DISSERTATION 2. LITERATURE REVIEW classifying information, forecasting to pattern recognition, encoding or fraud detection. Some intrinsic properties about Intelligent Systems are learning, generalization and abstraction. Learning For adapting to the environment, AI systems must self-adjust themselves in order to produce consistent responses. The discipline in charge of this task is machine learning, which has been already studied for more than half a century. In its beginnings, Arthur Samuel defined machine learning as a the field of study that gives computers the ability to learn without being explicitly programmed. And the ability to learn in this context is defined by Mitchell (1997) 11 as follows: A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with E. So there are basically these three elements: the task or goal activity for the system, the experience coming from the environment and a performance measure that acts as a feedback for the machine to know whether its behavior is the correct one. The design of the system must include a way for the machine to interpret the appropriation of its performance and change its behavior in the correct way. For doing s,o there are broadly three main types of learning algorithms: supervised machine learning, unsupervised machine learning and reinforcement learning. Supervised machine learning In the supervised learning the system is externally provided with a wide database of examples showing the right behavior to certain situations. This data has usually been previously labeled by humans. It can be seen as if the machine had a teacher that showed it how to behave. Learning by using this database of examples is called training. Once the machine is able to respond appropriately to the situations from the dataset, the training is complete and the machine is able to perform in conditions it has never seen before. Typical problems solved by supervised learning include regression and classification problems. Although most of these can be solved by numerical methods alone, more complex problems with many dimensions involved and intricate decision boundaries need machine learning for a satisfactory result. Unsupervised machine learning With unsupervised learning, no previous knowledge of the data is available. The goal of the program is finding patterns and features in the data in order to make it more structured. This is especially useful when bulky amount of data is to be analyzed, as it happens with Big Data. Some of the typical tasks performed by unsupervised learning systems are dimensionality reduction (principal component analysis PCA, factor analysis, etc.) or clustering into groups with similar properties (kmeans, Kohonen self-organizing feature maps SOFM, etc.). 11 Tom Mitchell, Machine Learning, McGraw Hill,

40 I. DISSERTATION 2. LITERATURE REVIEW Figure 4. Supervised machine learning. [17] The system compares its output with a desired output that is externally provided and adjusts its parameters in order to reduce the error. Reinforcement Learning In reinforcement learning, the system has to take actions in a certain environment seeking to maximize its reward. The environment is absolutely unknown in the beginning and for getting the reward the system may have to go through several states. The machine performs and trains on-line, so it must get into a process of exploration, through which it will gradually complete its model of the environment. Exploration will allow the system to build a mapping of the probabilities of an expected reward from each action it make take at each situation. The system will always have the dilemma of whether it has already found the best policy of actions and should keep exploiting it or otherwise it should continue exploring in order to find a better policy that will give more reward. Among the applications to which reinforcement learning applied there are game theory, operations research and genetic algorithms. Generalization The system internal association should not only work successfully with the data it has been trained on but also to novel sensory patterns. Generalization refers to how well the system performs in actual problems once it has been trained. It is one of the central challenges in machine learning, since it is not unusual that a system that seems to perform great fails systematically when some small, critical variations are applied to the inputs. For ensuring generalization in a supervised learning AI, a fraction of the training set must be taken apart and be used once the training is complete for checking its performance in novel inputs. This subset is called the test set. Although the training process seeks to minimize the error of the training data, what the system should really minimize is the test error, which is the real goal, since it proves that enough generalization has been attained. Generalization is also related to robustness, since once trained the system response should not change largely to minor variations in the input. It should see through noise and distortion. Abstraction Abstraction entails that irrelevant features of an input vector are discarded and the relevant ones are preserved. These features are not explicitly in the input data, so a process of abstraction is needed for extracting wide categories from which to classify. 21

I. DISSERTATION 2. LITERATURE REVIEW Figure 5. Generalization. [18] On the left, a model with good generalization (it is able to see through noise). On the right, a system with bad generalization.

1 Machine learning and traditional programming There are some key differences that draw a thick line between machine learning and traditional programming and clearly separates their methods and their

41 I. DISSERTATION 2. LITERATURE REVIEW Figure 5. Generalization. [18] On the left, a model with good generalization (it is able to see through noise). On the right, a system with bad generalization. Even when it perfectly fits the training data (in blue) it does not work that well with new data (in orange) with small variations Machine learning and traditional programming There are some key differences that draw a thick line between machine learning and traditional programming and clearly separates their methods and their applications. While traditional programming explicitly formulates a set of instructions that allows the automation of a process, in machine learning there is not such translation of the task into formal instructions. Machine learning systems are told what to do by providing data instead. They are able to understand part of the inner structure of the task and adapt their solutions to it. This has two direct implications that are characteristic of machine learning systems: 1. An identical program code can learn to perform different tasks depending on the dataset to which it has been exposed. In this fashion, the same computer program can learn to recognize cats or purple chairs. The difference between one and the other task does not rely on the written code, but on the numerical values of the variables of its structure, and these are adopted by the program itself during the self-learning process. 2. Since the instructions are not explicitly given and the system self-adjusts on its own, it is not easy to understand why complex machine learning systems perform well or not. In some sense, it can be regarded as a black box: we put things inside (input data) we can see the results (output answer) and we can change some parameters for tuning its behavior, but we do not really see nor fully understand the process happening inside. For this reason, one of the drawbacks of machine learning programs is that the debugging process can be really tough and frustration may arise from the feeling of working blindly. Since machine learning does not need explicit instructions, it especially suits tasks that humans cannot really express formally but can perform easily. This is the case of more intuitive tasks needing some level of abstraction, such as recognizing objects, symbols or patterns. Traditional programming has failed to solve these problems since it is nearly impossible to address all the small variations that an object, for instance, may have. Account for all the small changes in the angle at which it is viewed, its size, the luminosity and the huge amount of external variations (colors, shapes, textures, etc.) would be a tedious task and it would also need permanent revision. However, machine learning systems have proved to solve these tasks with remarkably success. 22

42 I. DISSERTATION 2. LITERATURE REVIEW Other tasks at which machine learning has overcome traditional methods are highly complex mathematical tasks, such as regression or data classification with many dimensions and strong non-linearity. Problems at which numerical methods had to spend an inadmissible amount of resources or time have been reasonably solved by machine learning. Aproach Method Knowledge/Acquisition Implementation System and Information Model Data, Noise, Analyze models to find Hardware implemen- Theory Physical Constraints optimal algorithm tation of algorithm AI Expert System Emulate human expert problem solving Observe human experts Computer program Trainable Neural Networks Design architecture Train system with exam- Computer simulation with adaptive elements ples or NN hardware Table 1. Comparison of information processing approaches 2.2 Artificial Neural Networks An artificial neural network (ANN) is a machine learning method inspired in the biological neurons in the brain. In a mathematical language, "ANN is a nonlinear directed graph with weighted edges that is able to store patterns by changing edge weights and is able to recall patterns from incomplete and unknown inputs." 12 ANN s are therefore a network of these neuron-like units (nodes) that are interconnected with channels (edges) through which they share information. Each neuron has multiple input signals and only one output 13. The neurons operate locally with a very simple behavior, with higher properties of the system arising from the interactions between them. On the other hand, The interactions are determined by some weight values associated to every connection. The learning of the system relies on changing the weights for improving the behavior of the system. The idea of having a distributed representation among small, simple units is the key to the power of the ANNs, since it allows the network to dramatically increase its capacity. This is because a distributed representation can be exponentially more compact than a local one such as nearest-neighbor or clustering models, since it can assimilate a set of parametric features that are not mutually exclusive and can be exponentially combined. In fact, it is believed that neurons in the brain share this distributed representation. There are many types of neural networks with many different tasks associated. Going through all of them, however, is not in the scope of this study. For this reason, only two of the most common ANNS will be presented: feedforward 12 P. K. Simpson, However, the output can be branched into a few channels all sharing the same value and going to different neurons 23

43 I. DISSERTATION 2. LITERATURE REVIEW Figure 6. Structure of a perceptron. [19] Each perceptron receives an input vector coming either from the output of other neurons or from the inputs of the system that passes through a scalar multiplication with the weight vector of the processing unit and an activation function is applied to the result. The output of the unit (also called activation value) can branch out for reaching other neurons and/or the outputs of the system. neural networks (FNN) and recurrent neural networks (RNN). The difference between them relies on the flowing of the information: while feedforward networks have the information flowing in one direction passing only once through each neuron, from input to output nodes, recurrent neural networks allow loops back that are slightly delayed from the original signal. In this sense, while FFNs are only spatial networks, RNNs also make account of the time, so they are more suitable for tasks in which time is a decisive factor, such as anomaly detection. Typical FNN s are the single-layer perceptron, the multilayer perceptron or the convolutional network. Some of the typical RNN s are the Holpfield network, the Boltzmann machine and the Long Short Term Memory (LSTM). Feedforward networks are the main type of neural nets employed for classification tasks, which is the goal of this study. The paradigmatic version of a feedforward network is the multi-layer perceptron (MLP), which is formed by layers of perceptrons stacked together. For this reason, the perceptron and its functioning will be explained first and then all the details about multi-layer perceptrons will be shown, from the learning algorithms to the weak points and some variations for overcoming them Perceptron The perceptron is a processing unit that performs as a linear classifier. Introduced by Frank Rosenblatt in 1958 [12], it was one of the first artificial neural networks to be created. Each perceptron receives an input vector X and performs a scalar product with its weight vector W. It then applies to the result z an activation function a(z) (which usually is non-linear), getting finally the output of the neuron. There is also an independent term which is added to the scalar product in order to give more degrees of freedom to the neuron, called the bias b: z = X W + b (1) 24

44 I. DISSERTATION 2. LITERATURE REVIEW In the beginning, perceptrons were mainly binary, having as an activation function a hard-limiter which can be written as { 1 z > 0 f(z) = 0 else. (2) From this configuration, perceptrons can be regarded in two ways. First, as a linear classifier that divides the input space by a hyper-plane, whose position is defined by the input weights and the bias. The hard-limiting activation function would give a positive value of +1 to the points at one side of the hyper-plane and a zero or -1 value for points at the other side. The training samples are labeled into two categories. In the training process, every time an output for a sample does not coincide with its label, the net changes the values of its weights and bias and places the hyper-plane in a better position for the classification. The second way is regarding the perceptron as a binary logic unit able to perform most of the basic logic functions, such as AND or OR. However, it cannot perform some other logic functions as XOR. The perceptron shows a strong limitation, since it can only solve problems that are linearly separable. For this reason, it soon discouraged its investigation for many years after its creation. When stacking binary perceptrons, the complex relations between the layers made it hard to direct the training in an efficient way. Because of the hard-limitting non-linearites that are non-differentiable, it was not possible to easily predict the effects on the net performance caused by a small change in a certain weight. On the other hand, stacked linear perceptrons are equivalent in power to a single linear perceptron, so there was no use in using them. It was not until the use of the sigmoid function that it turned to be effective enough to connect perceptrons in multiple layers, whose expression is σ(z) = 1. (3) 1 + e βz The sigmoid function is bounded between 0 and 1, as it happens with the hardlimitting function. However, the sigmoid function hast a continuous, monotonical increase and therefore is differentiable, so small changes in the weights cause small changes in the output. The steepness of the step can be regulated by the gain of the sigmoid β. As this value approaches infinite, the sigmoid resembles the hard-limitting non-linearity. Since the sigmoid can also take any value between 0 and 1, the output of the network provides finer information: it can express different degrees of confidence or probability in its answer or it can be used for regression (after proper scaling). Because of the good properties of the sigmoid function, multi-layer perceptrons started to gain attention again from the researchers Multi-layer perceptron A multi-layer perceptron (MLP) is a unidirectional network formed by layers of perceptrons. There is an input layer 14, one or more hidden layers and an output layer. The name of hidden layers comes from the fact that its outputs are not 14 The neurons in the input layer are not really perceptrons, they are rather nodes from which the input values come out. For this reason, some books do not consider the input layer as such. 25

45 I. DISSERTATION 2. LITERATURE REVIEW Figure 7. Sigmoid Function. directly observed and are usually hard to interpret, so they are hidden to our understanding. When a vector of input is connected to the network, the values flow through the layer by layer and finally reach the output nodes, where the output values are generated. There are no loops back to previous layers it is a feed-forward network and every perceptron in a layer is typically connected to all the perceptrons of the next layer. The expression of the sigmoidal perceptron j located in layer l of the network is ( a lj (W, X) = σ W lj [a l 1 (W, X) ]), (4) where W lj is the weight vector of neuron i in layer l and [ a l 1 (W, X) ] represents a vector with all the outputs coming from the previous layer l 1. This vector also includes a component with value 1, whose associated component in the weight vector is the bias b. The MLP tries to find a mapping function between the input and the desired output and they are really good at this task, as it has been proved by the Universal Approximation Theorem [20], which was formulated in It proves that arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity. This implies that not only a MLP with sigmoid functions can be used for creating meshed, disconnected decision regions, but it can also approximate any continuous function in R n, which is really promising. Other non-linear activation functions were later proved to work as well, such as the hyperbolic tangent tanh or the rectifier linear unit (outputs 0 for x < 0 and has linear behavior for x > 0). Multi-layer perceptrons are used in supervised learning tasks. During the training, we try to make the network learn a certain mapping f : X Y. For doing so, we provide a set of points in X that come with a label in Y. The network computes an output from each of these points and compares it with the corresponding label and then changes its inner structure in order to approximate its output to the label value. For doing so, the network builds the cost or error function, which reflects the difference between the network outputs and the desire outputs and tries to minimize it. There are a few different configurations for the 26

46 I. DISSERTATION 2. LITERATURE REVIEW Figure 8. Multi-layer perceptron. [21] Due to the smooth continuous changing of the sigmoid function, a small change in the weights leads to a smooth, small change in the output and gradient descent can be used. cost function, although the Mean Squared Error (MSE) 15 is the most popular one, as it gives preference to avoiding the bigger errors : C(W, X) = 1 2n ( y(xi ) a o (W, X i ) ) 2, (5) i where X denotes the set of all the training samples, y is the labeled desired output, W represents all the weights and bias in the network and a o (W, X i ) is the activation function of the output neuron for input sample X i 16. Note that this function has many terms as training samples and that each term includes all the structure of the network and the weight and bias values, represented by a o (W, X). It would not be reasonable at all trying to differentiate it with respect to every weight and finding the zero gradient, so another numerical method must be used for its minimization. For this purpose the gradient descent is used, which is a mathematical optimization method for which the differentiability of the activation function is vital. 2.3 Gradient descent Gradient descent, steepest descent or the method of steepest descent, is a firstorder optimization algorithm that seeks to find a local minimum of a function. For doing so, it calculates the first derivate of the function the gradient at a point and gives a small step in the steepest decreasing direction. This algorithm is iteratively repeated until a local minimum (gradient equal to zero) is reached. As a first-order numerical method, it avoids having to calculate the gradient in the whole input space, which can be extremely computationally expensive. What it does instead is calculating the gradient in one point, finding the direction in which the function fastest reduces its value and taking a small step proportional 15 Others broadly used cost functions are the cross-entropy function or the softmax function. 16 This expression corresponds to a MLP with a single output neuron. For multiple output neurons another summation would have to be added. 27

47 I. DISSERTATION 2. LITERATURE REVIEW Figure 9. First-order approximation. The function is approximated by a hyperplane whose gradient is the same as the one of the function in the approximating point a. As the evaluating point x gets away from a the approximation works worse and worse. to the value of the gradient in that direction. It is important that the step is not too big, since the first-order approximation C(W ) C(W ) W (6) only works well in the vicinity of the point at which the gradient is calculated. For making sure that the step is always on the steepest direction down and proportional to the slope of the function at that point 17, we should use W = µ C(W ), where µ is a positive constant called the learning rate. Finally, the expression for the weight and bias change at each iteration would be w j w j = w j µ C(W ) w j. (7) There are some considerations that must be taken into account when deciding the learning rate coefficient value. On one hand, it is obvious that a low value of µ would result in smaller steps and a slower algorithm that needs more iterations. On the other hand, a too big learning rate would threat the convergence of the algorithm, since it would get upper and upper at thin, deep concave regions (see Figure 10). The optimal value for the learning rate is usually around half of the largest learning rate that causes divergence [24]. An heuristic method proposed for finding this point is starting with a large value and if the algorithm diverges, divide by three, run it again and repeat until it converges (Bengio, 2011). The whole algorithm would work as follows: First, an arbitrary point of weights in chosen. Then, the local gradient is calculated for every weight and for every training sample. Finally, the step size is calculated and the weight and 17 By making the step proportional to the gradient, the algorithm will move fast at steep parts usually far from a local minimum and slow at near-flat areas usually close to local minimum. 28

48 I. DISSERTATION 2. LITERATURE REVIEW Figure 10. Learning Coefficient. [23] A too high learning rate will lead the algorithm to divergence. It has been proved that the optimal learning coefficient is equal to the inverse of the largest eigenvalue of the Hessian matrix H. From values higher than twice the optimal, the algorithm diverges. [25] bias are updated. Repeat it until either the error function or the gradient is small enough. Since some tasks use millions of samples, it would be extremely slow to calculate the gradient at every one of them. In these cases, a variation of the gradient descent that only uses a random subset of n samples at each iteration is applied, which is called the stochastic gradient descent (SGD). The training samples of the subset are not taken again until all other samples have been used, which is called an epoch. Note that the algorithm will stop at the first local minimum point that it finds, so in a complex function with multiple local minima, the local minimum at which it ends will strongly depend on the initial point of the algorithm, which is arbitrarily chosen. Moreover, points with gradient equal to zero (critical points) do no need to be local minima, they can also be saddle points 18. There are also some very flat areas at which the algorithm can barely move. For all these reasons, in practice the goal is set at finding a very low value of the cost function that is not necessarily a minimal value. There is one aspect of the gradient descent algorithm that has not been explained yet: how to calculate the gradient of the cost function with respect to every weight. Multi-layer perceptrons have many interconnections and tracking back the derivatives through the network can be a cumbersome task. Indeed, in the beginning MLPs were considered impractical since its learning process was extremely slow. It all changed when an elegant, smart way of calculating the gradients of all weights at once was included in the training process of neural nets: the back-propagation algorithm Back-propagation More complex ANN can have hundreds of neurons and thousands of weights and with every training sample the gradient of the cost function with respect to every single of those weights has to be calculated in order to apply the gradient descent algorithm. Back-propagation (backward propagation of errors) was proved to 18 And they can also be local maxima, but the algorithm cannot reach those. 29

49 I. DISSERTATION 2. LITERATURE REVIEW be an extremely efficient way for carrying out this task 19. It actually gives the calculation of the gradients the same computational cost than the feed-forward propagation of the network. For achieving so, the algorithm takes advantage of the chain rule and makes use of the calculations of the gradients in the upper layers for obtaining the results in the lower layers. The name backward propagation of errors comes from the fact that it starts calculating the gradient of the error in the output units and then goes down layer by layer, just in the opposite direction as the forward propagation of information. In order to understand better the implementation of this algorithm, one should start with differentiating the cost function expression (5) for calculating the gradients of the weights of the output layer. For the sake of simplification of the notation, only the gradient at one training sample will be shown. The total gradient is obtained by average summing the one from all the training samples in the batch. The gradient of the cost function of one training sample with respect to the weights of the output neuron W o is Wo C(W, X) = ( a o y(x) ) Wo a o, (8) and recalling from (4), Wl ja lj in a sigmoidal network is Wl ja lj = σ ( W lj [a l 1 ] ) [ al 1 ]. (9) Therefore, the equation can be fully expanded to Wo C(W, X) = ( a o y(x) ) σ ( W o [a o 1 ] ) [ ao 1 ] = = δ o [ ao 1 ], (10) where δ o is defined as the error of the output layer and allows the calculation of the gradients of all weights and the bias of the output neuron. When the gradient of the neurons of the up-most hidden layer are to be computed, expression (10) can be used for going further in the chain rule by applying expression (9) again: Wo 1j C(W, X) = δ o w o o 1j Wo 1j a o 1j = = δ o w o o 1j σ ( W o 1j [a o 2 ] ) [ ao 2 ] = δo 1j [ ao 2 ], (11) with w o o 1j meaning the weight that connects both neurons o and o 1j, and so on with subsequent layers. Note that every neuron at every layer has a different layer error, δ lj, and that, in the case of the bias, the a o 2 term is substituted by the unity. Finally, the expression for computing any layer error is directly dependent on the layer errors of the upper layer, δ lj = [ δ l+1 ] Wl+1 lj σ ( W lj [a l 1 ] ). (12) 19 The back-propagation is sometimes believed to be the whole algorithm for MLPs, when actually only refers to the method for calculating the values of the gradients. 30

50 I. DISSERTATION 2. LITERATURE REVIEW So, first, the error of the output layer is calculated by expression (10) and then its value flows back up layer by layer. The step from one layer to the next is achieved by applying the inverse of the activation function and therefore its computation is similar to the feed-forward, with which the activation function is the computational step for going through the next layer Complete gradient descent algorithm When incorporating the back-propagation to the gradient descent, the following algorithm is applied: 1. Take a training example (randomly chosen if SGD is used). 2. Feed-forward it and calculate all the outputs of the neurons a lj. 3. Compute gradient of weights through back-propagation of errors. 4. Back to 1 until all training samples are used or the batch is complete (SGD). 5. Average gradients. 6. Update weights and bias with (15). 7. Back to 1 for next batch. The algorithm just shown does not specify when to be terminated. There are a few options for making this decision. There can be a previously fixed number of epochs or a threshold for the error or the gradient that stops the algorithm when it is trespassed. Another common technique, called early stopping, calculates the error at the validation set at the end of each iteration and stops the algorithm if it does not improve after a few batches, making it a simple method for avoiding over-fitting. There are many variations of the algorithm that overcome some of the weak points of MLPs. Some of them will be introduced in section Other details of the training process Hyper-parameters Hyper-parameters are parameters that affect the training of the network but cannot be included in the gradient descent method, since that would systematically lead to an over-fitting system. Examples of hyper-paramateres are the size of network (number of hidden nodes), the learning rate, the number of epochs, etc. These parameters are thus closely related to the generalization of the network and its capacity. In order to optimize the hyper-parameters, a sub-set of data is extracted from the training set, which is called the validation set. The system does not use this data for training, so it can be employed for observing the impact that different values of the hyper-parameters have on the generalization of the network. However, the error at the validation set does not give a full idea of the generalization capabilities of the network, since some decisions of its models have been based on the information from this data. For this purpose, there has to be a third sub-set of data samples, the test set, that is not used at all in any part of 31

51 I. DISSERTATION 2. LITERATURE REVIEW the training procedure. Since the best picture of the generalization capabilities come from the error of the network at evaluating this data, it is also called the generalization error. There are some optimization algorithms for achieving the best hyper-parameters. However, other simpler methods are used more frequently, as they are easier to implement. The simplest ones are manual search, which consists in just trying different values for the hyper-parameters and keep changing them in the direction that they seem to improve the results. This method does not take into account cross-influences between the parameters, a flaw that is covered by grid search. This latter algorithm includes nested loops of systematical variations of a parameter, exhausting in this way all the possible combinations. The main flaw of this other method is the high computational charge that it implies. Another variation consists of taking every time random values for each hyperparameter, consequently called random search. Although it may appear to be too uncontrolled, it has proved to be more efficient than grid search in most occasions [27]. 2.4 Problems of learning with multi-layer perceptrons Complexity of learning The problem underlying the training of a MLP is a NP-complete one, which means that there is no efficient way known for addressing them. Their solution cannot be found on the first place, heuristic and approximation methods have to be used instead. Moreover, its complexity very quickly increases with the size of the problem. This has always been one of the main limitations in the application of ANN. Advanced heuristics and the exponential growth in computer power are leading to clusters of computers that can be used for tackling down bigger and bigger problems, but the complexity of the problem itself remains as an anchor to its development. When it comes to back-propagation algorithm, an extra factor comes into play that slows down the learning: the error surface of a sigmoid MLP is quite harsh, as it contains some steep parts and many almost flat areas that can be confused with local minima. Increasing the learning rate for going faster in the flat parts may be dangerous for convergence, though. Over-fitting As it has been mentioned before, generalization is one important capability that is vital for the success of an ANN. A bad generalization leads to over-fitting, which is memorizing too much the details of the training set and finding relations that are not really there, but just are accidental coincidences coming from the randomness of the samples set generation. A good examples of an over-fitting system is displayed at Figure 5. One of the reasons for having an over-fitting system is its having a bigger capacity that the one needed for the task at issue. For these cases, a network shrinking would be necessary, pruning out some of the neurons until the performance starts to worsen. A graphic showing the relation between capacity and generalization 32

52 I. DISSERTATION 2. LITERATURE REVIEW Figure 11. Error function of a MLP, with steep slopes and flat areas. Figure 12. Generalization vs. Capacity. [13] Although the train error always improves with a more capable system, that does not mean that it will perform better, since the reduction of train error may come from creating complex relations in the training samples that are not really there but are just the result of coincidence. can be appreciated at Figure 12. Another method for preventing from over-fitting that has also been mentioned before is early stopping, by which the criteria for finishing the learning algorithm is seeing no improvement in the generalization error from the validation set, instead of looking at the training set. It prevents the algorithm from reaching some too-low error points that correspond to an over-fitting system. Exploding and vanishing gradients Exploding and vanishing gradients are the result of applying gradient descent to a network with activation functions like the sigmoid function. Recalling expression (12) from the back-propagation algorithm, one can see that the layer error of a neuron in any hidden layer is directly proportional to the derivate of the sigmoid 33

53 I. DISSERTATION 2. LITERATURE REVIEW Figure 13. First derivative of the sigmoid function. function, σ. However, since the sigmoid asymptotically approaches to both 0 and 1, the derivative is close to 0 at all points that get far from the center of the function (see Figure 13), leading to a really low error layer and, as a result, to a really slow learning. In this situations it is said that the neuron is saturated. The risk of saturation is amplified at the lower layers of deep networks (with many layers stacked), making the learning of their neurons even more difficult. For avoiding the saturations in the neurons, it is important to initialize the weights in a range which is both not too big and not too small. Local minima Since the gradient descent algorithm only searches the decreasing error direction locally, there are high chances that it gets stuck in local minima points. Although this may make the algorithm to stop at solutions that are not good enough, it also avoids taking the system to a global minimum that may correspond to an overfitting behavior. Actually, recent findings have proved that local minima with poor quality are a rare thing for big systems and most stopping points have similar quality [13]. However, this is not the case for smaller networks. The consequences of local minima in a small ANN is that the final result of the training strongly depends on the initial values of the weights. The weight initialization is usually set randomly for a few instances of the training, so there are more chances to arrive to a minimum point with good enough results. 2.5 Variation on the algorithm There are a lot of variations in the algorithms employed for the training of MLPs. Since finding the right weights is a NP-problem, many solving methods are just based in heuristics and different variations work well for different kinds of tasks and data structures. Most of the variations are partial solutions to the problems exposed at the previous section. 34

54 I. DISSERTATION 2. LITERATURE REVIEW Overcoming flat surfaces As it has been mentioned before, MLPs error surfaces are harsh and dominated by big areas quite flat at which the algorithm advances really slow. Rising the learning coefficient is not a good solution for speeding up the gradient descent learning process, since it also risks the convergence to lower points. A partial solution to this problem would be adding a momentum term, which has some memory of the previous weight changes, to the updating of weights and bias. By doing this, the speed when entering a flat surface is kept for some steps and there is no risk of diverging, as this technique partially cancels movements that change to the opposite direction, so it also allows to have higher learning coefficients without its risks. Momentum is implemented at the weight update part of the gradient descent algorithm. It helps keeping part of the previous weight change thanks to an extra term added to the end, w j w j = w j µ C(W ) w j + α(w j w t 1 j ). (13) The parameter α, whose value is limited to the range (0, 1), controls the importance the algorithm gives to previous changes, it indicates the inertia of the algorithm. Other options include having different learning rate for each layer of even of each neuron or each weight. This allows some flexibility in the updating, giving slower weights a faster training than the others. However, testing learning coefficient values for every single weight of the network would be extremely time inefficient. An heuristic rule of thumb would be giving it a value that is inversely proportional to the average input values to that neuron [12]. Then, there would be one hyper-parameter and all the others learning coefficients would be proportionally altered. Impoving generalization: Regularization Complexity regularization are a series of techniques whose goal is improving the generalization of the system, usually at the expenses of a having a simpler network with higher training error. Amongst the regularization forms, regularization L2 is the most commonly used one. Its main goal is eliminating unnecessary weights that are not especially relevant to the system and hence it is often called weight decay. For penalizing the weights that do not really contribute, regularization L2 adds an extra term to the cost function from expression (5), C(W, X) = 1 2n i ( y(xi ) a o (W, X i ) ) 2 + λ 2 wj, 2 (14) where λ is the regularization parameter, controlling the importance that is given to having a smaller system instead of lower training error, and w refers to all the j 35

55 I. DISSERTATION 2. LITERATURE REVIEW weights in the network but not the bias 20. As it can be seen in expression (14), big weights affect the cost function in a quadratic form, so high weight values are penalized and only allowed if the training error is considerably reduced with them. In fact, a high λ value can cancel a big amount of the weights of the network by taking their values to zero. It can be seen as the value of the weight of this system directly proportional to the influence it has on the solution. Regularization L2 has the following alteration to the final step of weight updating, ( ) C(W ) w j w j = w j µ + λw j w j = (1 µλ)w j µ C(W ) w j, (15) where C(W ) w j only addresses the MSE term of the cost function without including the regularization term. The only different between this expression and (15) is the (1 µλ) term multiplying the previous weight, which constantly pulls the weight value down, making it to decay if the gradient of the cost function does Another method that improves generalization consists in adding some random noise to the hidden inputs; a method that is called dropout. However, it will not be discussed in this paper as it goes beyond its scope. Competitive Learning First order approximation introduces some flaws in the algorithm. The partial derivative accounts for the change in a function with respect to one variable assuming all other variables remain the same. However, when the gradient descent algorithm ends an epoch, it updates all weights at the same time. This introduces some incorrectness in the basic assumption and affects the efficiency of the training. A method that helps addressing this problem is competitive learning. It eliminates some of the layer errors, namely those whose value is below the layer average. By doing so, the algorithm only preserves the layer errors that are most influential, partly reducing the assumption flaw. Unsupervised pre-training Another option that has proved to work really well for tackling both over-fitting and exploding and vanishing gradients in deep MLP, is pre-tuning the system with unsupervised learning. This can be done by build the lower layers with either stacked auto-encoders or stacked Boltzmann machines. For training this networks, unsupervised learning is used first in a greedy layer-by-layer training and then the MLP is fine-tuned by back-propagation. This techniques are more suitable for deep networks, so there are not discussed further as the ANN for this study is shallow. 20 The reason is that the bias can fit more accurately with less data, since they only control one variable instead of two. 36

56 Chapter 3 Methodology T HIS chapter introduces the methodology followed for carrying out the tests. It includes the objectives of the study, relevant information about the data being used and details of the configuration of the ANN and the training methods. The chapter serves as a starting point for understanding the development of the study, which is thoroughly described at the next chapter. 3.1 Objectives and requisites The goal of this thesis is studying whether an artificial neural network can be implemented for helping in the monitoring of a cellulase production process. Specifically, it will help deciding when to turn on the C/N feeding. This decision, currently done by a human expert, must be properly learned by the intelligent computer program. Its performance must coincide with the human one, with a margin of error allowed of a maximum of five hours. Since the aim of incorporating an intelligent controlling system is the automation of the production process, the AI program should dispense with the manually measured variables and base its decisions purely in the analog ones. 3.2 Data collection The data sets used for this study have been provided from some experiments carried out in the previously specified conditions. These are the most successful experiments obtained in Tatung s Biological Engineering Department and therefore the ones whose behavior should be learned by the artificial system. There is data from nine complete cellulase production experiments. Each data set contains all the values from the nine measured variables the three analog variables and the six that are manually measured. Since only the data before the start feeding point is necessary for the goal of the study which usually happens from the 20 th to the 40 th hour, the rest of the data is discarded 21. The data from all the experiments is presented at the end of this paper, in Appendix A, at page The data right after the C/N feeding starting cannot be used for testing the AI system performance since it is already influenced by the new experiment conditions. 37

57 I. DISSERTATION 3. METHODOLOGY Each experiment presents some small variations in the growth conditions and the controlling strategies. These variations are positive in the sense that they can improve the robustness of the AI system. Some relevant information about certain experiments is explained next. Experiment 1 (EXP1). This experiment presents two gaps in the data (at 16.00h and 16.92h, respectively) corresponding to moments at which the computer shut down. Just before the incident, the ph and DO values went unusually down since the control system was already having problems. Experiment 3 (EXP3). Due to differences in the medium, the acidity does not change that sharply in this experiment and therefore there was not a noticeable raise in the ph at the start feeding point. Experiment 6 (EXP6). This experiment also contains a few gaps in the data, although they are small (just two samples at each gap) and the system behavior was not altered much around them. All the gaps are smoothly covered in order to make them readable for the AI system. However, a bigger problem in the ph controlling system of EXP6 around the 22 nd hour appeared, leading to abnormal ph values as low as 5.1. Normal ph values are recovered around the 24 th hour. Experiment 7 (EXP7). In this experiment there was not pure oxygen feeding at all, so the DO value remains close to 0 during all the time. Experiment 8 (EXP8). As it happens with EXP3, the acidity of this experiment behaves differently than the others. This happens because the growth medium in this experiment contains malt oil. In order to extract the most relevant features of the raw data, some filters have been applied to it; namely, a differential filter and a exponential smoothing filter. Differential filter. This filter performs a simple subtraction operation among the consecutive data points of the same input variable. The gap (g) of the filter represents the number of points in between the two that are being subtracted. The first g points remain the unaltered. x (t) = x(t) x(t g) (16) Exponential smoothing filter. Exponential smoothing acts as a low-pass filter, reducing the high-frequency noise in a signal and highlighting the trend. It does so by averaging the data points with all the previous data in a way that gives exponentially more value to the most recent ones. The weight for the point being evaluated is α and the weight for the previous points is α(1 α) g, where g is the gap between the point at issue and the point being evaluated. x t = αx t + (1 α)x t 1 = = αx t + (1 α) ( αx t 1 (1 α)x t 2) = = αx t α(1 α) n x t n (17) 38

58 I. DISSERTATION 3. METHODOLOGY 3.3 Algorithms The ANN that is studied for the task is a feed-forward multi-layer perceptron trained with back-propagation. A shallow 2-layers network is used, with one layer of hidden neurons and a output layer with one output neuron. The inputs of the system are the three analog variables and some variations of them with some filters (namely, exponential smoothing and differential filters) are applied. Every time point 22 in a experiment has three real values associated, corresponding to each of the three analog variables, and four other extra values from the derived ones. Each of these time points constitutes the input (X) of a training sample. The output of the system is modeled as a variable in the range of [0, 1] with 0 meaning complete confidence that the C/N feeding should start and 1 a complete confidence that the feeding should start. Every train sample has an output value associated to it, manually labeled. Only the training samples close to the last points where the C/N feeding started should have a high output. All the other training samples should be labeled with outputs close to zero. A threshold indicating the level of confidence needed for triggering the C/N feeding will be set a posteriori, choosing the value that has a biggest margin as possible with both the lowest confidence peak that appropriately triggered the feeding and the highest false confidence peak that should no trigger it. The particularities about the MLP configuration are listed below. Some of them are not included from the beginning and were added after earlier prototypes were not successful enough. The details of the successive modifications of the network are explained in section 4. The scaling of the inputs is linear and was initially designed for having most relevant values in the range of [0, 1]. Later on, this range was changed to [ 1, 1] and the output was scaled to that range as well. There are also saturations in the input variables, limiting some values of the inputs that are out of the mentioned range. The activation function used for the perceptrons is the sigmoid function. However, it is slightly modified from the original one in order to allow it to have negative outputs down to 1: σ(z) = 2 1 (18) 1 + e βz Batch training is used, what means that the weights and bias are updated every time the gradient of all the training samples is calculated and averaged. For this reason, picking them randomly is not necessary, so they are taken in their original, chronological sequence. The cost function used is the mean squared error MSE (Equation5) and includes L2 regularization not scaled on the training size. The training algorithm uses competitive learning, which means that the layer error from all neurons is not taken into account every time, but only 22 Every two consecutive analog measures are separated by 5 minutes. 39

59 I. DISSERTATION 3. METHODOLOGY the upper half with highest value. This helps avoiding the ill-conditioning problem by making the updating more sparse. For more details on this, refer to section 2.5 in page 36. The hyper-parameters of the network are manually tuned through trial and error. The full list of the hyper-parameters follows next. Hidden neuron number (H). This positive, integer parameter is the amount of neurons in the hidden layer. This parameter affects the capacity of the system, the learning speed, the generalization ability and the power of the system. A low value of this parameter leads to a system that may not be capable of grasping the full complexity of the problem and therefore not being able to accomplish the task (underfitting). A high value of the parameter considerably slows down the training process and may lead to a too capable system, e.g., a system that retains too much the details of the training set and therefore performs worse at the train set (overfitting). See figure 5 at page 22 for an example of an overfitting system. Learning coefficient (µ). It is a positive, real parameter that sets the size of the step at the gradient descent. A too high value of µ can make the algorithm diverge. On the other hand, too low values lead to too slow training processes. For more details on the learning rate, refer to section 2.3, page 28. Regularization parameter (λ). This positive, real parameter is the factor for the regularization term that helps improving generalization. A too low value of λ would make regularization inappreciable and generalization may get worsened. On the other hand, a too high value will give too much value to having weights close to zero and the system may under-perform. For more details on regularization, refer to section 2.5 on page 35. Slope of the sigmoid function (β). This positive, real parameter determines how smoothed or sharp is the sigmoid function, as it can be appreciated in Equation 18. A low value of β leads to a smooth function with a wider slope. This shape helps reducing the flat areas in the weight space in which the gradient descent algorithm may get stuck. On the other hand, a steep sigmoid function (high value of β) leads to a faster training process. Initial weight range (r). The departing point of the gradient descent method has to be chosen arbitrarily. This point is randomly picked from inside the delimited range [ r, r]. For values of weights that are too high or too low, the system may have a really difficult beginning, due to the effects of exploding and vanishing gradients 23. An optimal initial 6 weight range is given by the expression r = in+out, where in and out stand for the number of input and output connections of a neuron, respectively. 23 For further details on exploding and vanishing gradients, check section 2.4 on page

60 I. DISSERTATION 3. METHODOLOGY Maximum training times. This integer parameter determines the number of epochs 24 after which the training stops. It should be big enough to allow the system to reach a low value but not as big as to let the system to overfit. Stopping the algorithm when the generalization error starts to increase is called early stopping, although it is not used in the current algorithm. 24 An epoch is a training cycle after which the weights and bias are updated. In the MLP being studied this happens every time the back-propagation algorithm goes through all the training samples (batch training), but in stochastic gradient descent (SGD) it happens with smaller mini-batches. 41

61 I. DISSERTATION 3. METHODOLOGY 42

62 Chapter 4 Description of the work I N this chapter, all the details about the development of the AI system are thoroughly explained. For this purpose, all key decisions made during the process, the reasons backing them and visual results are included. There are also multiple references to the graphics that display the raw data of the experiments, located at Appendix A, page 89. There have been two different prototype systems, the second one being created after realizing that the approach taken for the designing the first one was not the most appropriate for the task at issue. 4.1 First prototype of AI system When the first AI prototype was designed, not all the data sets were available yet: only from EXP1 to EXP5. Experiment 1 was put aside as a test set in order to evaluate the system on data it has never seen before. As a test set, it should not be used either for making any decision about the model or the hyper-parameters. Experiment 2 is randomly taken as the validation set 25 and experiments 3 to 5 constitute the training set Concept exploration: Choosing the features The first that had to be done was choosing the features that will be taken as inputs for the AI system. There were the three analog variables (namely ph, DO and ORP), to which the filters mentioned in section 3.2 could be applied. After visual inspection of the three variables in the experiments from EXP2 to EXP5, it was appreciated that: 1. The DO value was mainly important in terms of absolute value. Despite of having experiments 4 and 5 with a clear raise in the DO value, in EXP2 the DO just stayed high before the feeding step up, with no remarkable raise. What is shared by all experiments is having a DO value above 20%. Since maximum DO are close to 50%, this variable is scaled by a factor of 50, limiting the inputs to a range of [0, 1]. 2. There is a small increase in the ORP in most of the inspected experiments, although at different absolute values. For instance, while EXP2) shows a small increase from 90mV to 140mV, in EXP4, the step goes from 190mV 25 The set that allows adjusting the hyper-parameters in a way that does not lead to over-fitting. 43

63 I. DISSERTATION 4. DESCRIPTION OF THE WORK Figure 14. First prototype s inputs and desired output to 250mV. EXP3 is an exception; not due to a lack of an increase, but due to having one that happens in a more long-term scope. For these reasons, the ORP is read as a differential value and not absolutely, with high differential values being supportive of a step-up decision. First, an exponential smoothing filter with α = 0.05 is applied to the variable in order to smooth the noise. Then, a differential with a gap of g = 1 is applied on top. The range of this new variable goes approximately from 3mV to 5mV, with the false trigger reaching up to 7mV. The variable is scaled with a factor of 5, so almost all inputs are inside the range [ 1, +1]. 3. The ph value shows a raise to a value close to 5.60 in all experiments except from EXP3. Nevertheless, it was not included in this first prototype since it was preferred to start with a simpler system that could be slightly updgraded. Another important decision that was taken in the first inspection of the data for the first prototype was skipping the first 10 hours of the experiments for the training of the AI system. This decision was made because at the beginning of the experiments the DO values are usually really high and a raise in the ORP also exists. After the first 10 hours, however, all the DO values are below 50% and the initial raises in ORP are cut out. Cutting at this value is safe as the earliest feeding step up occurs at the 25 th hour, so a big security gap of 15 hours exists in between both points. Once the decisions about the training features of the inputs are made, the desired output is to be crafted. For this prototype, the points that are close to the end of the experiments that comply with the chosen favorable conditions for stepping up namely, a DO value higher than 20% and a differential ORP value around 2mV are given a high output value and the rest of the points are given an output value of 0. The inputs and the desired output of the system are displayed at Figure System definition The network used for the training is a one-hidden-layer MLP with MSE as the cost function and a sigmoid with β = 1 as the activation function. BP with batch updates and competitive learning is employed for the training. The range for the initial weights is initially set to [ 1, +1]. The total amount of training samples available for this system is 818. A simple diagram of the architecture is shown at Figure 15 44

I. DISSERTATION 4. DESCRIPTION OF THE WORK Figure 15. First prototype s architecture. The number of neurons in the hidden layer can be arbitrarily changed since it is a system hyper-parameter.

64 I. DISSERTATION 4. DESCRIPTION OF THE WORK Figure 15. First prototype s architecture. The number of neurons in the hidden layer can be arbitrarily changed since it is a system hyper-parameter. Training 1 Training 2 Training 3 Hidden Neurons Learning Rate Number of Epochs MSE Table 2. Hyper-Parameters and results for the first trainings Training the system There has been a long process of trial and error through which a lot of knowledge about the problem at issue was obtained and proper adaptations were applied to the system. The modifications made to the system are displayed in chronological order along with the consequences that they had to the results. Initial system The first exploratory systems were run with the parameters displayed in Table 2. In spite of the small variations in the hyper-parameters, the results for the training did not change much. By looking a the outputs (Figure 16), it can be seen that all the peaks in the desired output are successfully followed by the ones of the system. However, there is a dangerous false trigger around the 22 nd hour of EXP5. Although a threshold with a value of 0.85 would have led to successful results, the margin is really narrow and small noise in the inputs could easily provoke a bad trigger or missing one of the good trigger. Figure 16. Outputs of initial system. 45

65 I. DISSERTATION 4. DESCRIPTION OF THE WORK Training Training 2 Training 3 EXP3 trigger EXP4 trigger EXP5 trigger EXP5 false trigger Margin Table 3. PI for first trainings. The margin is the difference between the lower good trigger and the false trigger of EXP5 It seems that the MSE is not a good index for evaluating the quality of the results, since small deviations from the desired output have no impact in the quality of the performance. For this reason, the margin between the higher false output peak and the lower correct output peak is set as the performance index (PI). The values of the PI of the first training instances are displayed in Table 3. Although the first one has slighly a higher margin, all of them have an inadmissible value. A margin above 0.3 would be reasonable. Reducing irrelevant data for improving performance in key points In order to improve the results in the critical points (e.g., the good triggers and the false one), data from the irrelevant parts is withdrawn from the training set. By doing this, the training algorithm focuses on decreasing the errors at the critical points, since it does not get distracted by the other data. For doing so, first only the data starting at the 20 th hour from EXP4 is taken, and then only the data starting at the 25 th hour from the same experiment. Both trainings keep the same parameters as Training 3 (see Table 2). The output of these two trainings did not improve with respect the previous ones, with margins of and 0.154, respectively. In order to reach a PI above 0.30 other methods had to be used. For instance, the ph was left behind when choosing the features for the system and it seemed to have some valuable information. Another idea for improving the performance came from the fact that the system may be concentrating too much in bring to 0 the output at the flat areas instead of minimizing the critical points. For improving this, the desired output could have slightly higher values at the parts of the flat parts where the input variables approach the triggering conditions. There were other ideas for improving the system (such as changing the window for the differential ORP, include differential DO or adding a second hidden layer to the MLP), but they were not deeply considered yet. Adding the ph and altering the output For the next version of the system prototype, there was a new set data available: EXP6. This experiment is characterized by a rising ph by the end, a high value DO for the las half of it and a very flat ORP. Although the different ORP behavior may be problematic, both the DO and the ph comply well with the other experiments behavior. Another special feature about this experiment is the big drop-down of the ph around the 23 th hour, due to an error of the controlling system. This experiment is included in the validation set together with EXP2. 46

I. DISSERTATION 4. DESCRIPTION OF THE WORK Figure 17. New desired output with no flat parts. The ph has a raise up to 5.60 around the start feeding point.

66 I. DISSERTATION 4. DESCRIPTION OF THE WORK Figure 17. New desired output with no flat parts. The ph has a raise up to 5.60 around the start feeding point. This occurs in all experiments except from EXP3 where there it stays around The reason for this is rooted on the growth condition, which affects the behavior of the acidity. No other points approach the 5.6 line, so it seems taking the absolute value of the ph as a feature should work well. In order to smooth a little bit the noise around the low levels of ph, an exponential smoothing filter with α = 0.2 is applied to the variable. Most interesting data occurs in the range [5.40, 5.60], so an offset of 5.4 and a factor of 0.2 are used for scaling, with the interesting data then staying in the range [0, 1]. Regarding crafting the new desired output, the flat-zero parts are raised up to three times according to the following criteria: 1. If the ph input is above 5.45, raise output by 0.1. If ph is above 5.525, raise it by 0.2 instead. 2. If the DO input is above 15%, raise output by 0.1. If it is above 30%, raise it by 0.2 instead. 3. If the differential ORP input is above 0.5mV, raise output by 0.1. If it is above 1.5mV, raise it by 0.2 instead. The results of applying these rules can be seen in Figure 17. The hyper-parameters used fo this training are the same as the ones employed for Training 3, except that the learning rate µ is lowered to The results of the training are quite positive, as the PI improves from to 0.251, although it has not reached 0.30 yet. Since there is a more acceptable performance in the training set, the system is also evaluated in the validation set. While results for EXP2 are quite positive really low output during the whole experiment and a steep raise up to 0.9 at the very end, results for EXP6 are not that good. As it is shown in Figure 18, there is a false trigger around the 18 th hour that raises above the true trigger at the end of the experiment. It seems that the right and false triggers from EXP2 and EXP6 should also be added to the PI, as it is shown in Table 4. There is something in common with both false triggers at EXP5 and EXP6: they have a variable with a value out of the common range. In the case of EXP5, the false trigger has a differential ORP around 7mV when the two next higher peaks are at 5mV and 3mV. In EXP6, the false trigger has a DO peak close to 80% when most DO value do not overpass 50%. This observation suggested the next modification of the AI system. 47

I. DISSERTATION 4. DESCRIPTION OF THE WORK Figure 18. Results of the system with ph when applied to EXP6. Training 6 EXP2 trigger 0.915 EXP3 trigger 0.923 EXP4 trigger 0.951 EXP5 trigger 0.

67 I. DISSERTATION 4. DESCRIPTION OF THE WORK Figure 18. Results of the system with ph when applied to EXP6. Training 6 EXP2 trigger EXP3 trigger EXP4 trigger EXP5 trigger EXP6 trigger EXP5 false trigger EXP6 false trigger Margin Table 4. Extended PI. Adding saturations to the input variables For helping lowering the impact of the false triggers, saturations are applied to two input variables: the DO is limited to 50% and the differential ORP is limited to 2mV. The scaling factor for the differential ORP is consequently lowered to 2mV as well. The system was trained with the same parameters as in the previous training. The results again significantly improve, as the false trigger of EXP5 gets down from to and the false trigger from EXP6 changes from to On the other hand, the good trigger for EXP6 (which was the lowest one of all the experiments) rises from to Summing up, the PI improves from to 0.260, which is again an acceptable margin value. The threshold should be placed then in the middle of this margin range, which happens to be Optimizing the system: pruning the hidden neurons Once the AI system has proved to work reasonably well for both the training and validation sets, a work of optimization should be carried out in order to push the system to its limits and achieve that 0.30 PI that was set as a goal. The first hyperparameter that was optimized is the number of neurons in the hidden layer (H). A high value of this parameter may lead to a too complex AI system that may incur in over-fitting. For checking which is the minimum acceptable size of the NN, trainings with decreasing H were carried out, ranging from 10 to 1. The other parameters keep the same as the previous training. 48

AP SPANISH LANGUAGE 2009 PRESENTATIONAL WRITING SCORING GUIDELINES SCORE DESCRIPTION TASK COMPLETION* TOPIC DEVELOPMENT* LANGUAGE USE*

AP SPANISH LANGUAGE 2009 PRESENTATIONAL WRITING SCORING GUIDELINES SCORE DESCRIPTION TASK COMPLETION* TOPIC DEVELOPMENT* LANGUAGE USE* 5 Demonstrates excellence 4 Demonstrates command 3 Demonstrates competence