On noise regularised neural networks: initialisation, learning and inference

Pretorius, Arnu (2019-12)

Thesis (PhD)--Stellenbosch University, 2019.

Thesis

ENGLISH ABSTRACT: Innovation in regularisation techniques for deep neural networks has been a key factor in the rising success of deep learning. However, there is often limited guidance from theory in the development of these techniques and our understanding of the functioning of various successful regularisation techniques remains impoverished. In this work, we seek to contribute to an improved understanding of regularisation in deep learning. We specifically focus on a particular approach to regularisation that injects noise into a neural network. An example of such a technique which is often used is dropout (Srivastava et al., 2014). Our contributions in noise regularisation span three key areas of modeling: (1) learning, (2) initialisation and (3) inference. We first analyse the learning dynamics of a simple class of shallow noise regularised neural networks called denoising autoencoders (DAEs) (Vincent et al., 2008), to gain an improved understanding of how noise affects the learning process. In this first part, we observe a dependence o f learning behaviour on initialisation, which leads us to study how noise interacts with the initialisation of a deep neural network in terms of signal propagation dynamics during the forward and backward pass. Finally, we consider how noise affects inference in a Bayesian context. We mainly focus on fully-connected feedforward neural networks with rectifier linear unit (ReLU) activation functions throughout this study. To analyse the learning dynamics of DAEs, we derive closed form solutions to a system of decoupled differential equations that describe the change in scalar weights during the course of training as they approach the eigenvalues of the input covariance matrix (under a convenient change of basis). In terms of initialisation, we use mean field theory to approximate the distribution of the pre-activations of individual neurons, and use this to derive recursive equations that characterise the signal propagation behaviour of the noise regularised network during the first forward and backward pass o f training. Using these equations, we derive new initialisation schemes for noise regularised neural networks that ensure stable signal propagation. Since this analysis is only valid at initialisation, we next conduct a large-scale controlled experiment, training thousands of networks under a theoretically guided experimental design, for further testing the effects of initialisation on training speed and generalisation. To shed light on the influence of noise on inference, we develop a connection between randomly initialised deep noise regularised neural networks and Gaussian processes (GPs)—non-parametric models that perform exact Bayesian inference—and establish new connections between a particular initialisation of such a network and the behaviour of its corresponding GP. Our work ends with an application of signal propagation theory to approximate Bayesian inference in deep learning where we develop a new technique that uses self-stabilising priors for training deep Bayesian neural networks (BNNs). Our core findings are as follows: noise regularisation helps a model to focus on the more prominent statistical regularities in the training data distribution during learning which should be useful for later generalisation. However, if the network is deep and not properly initialised, noise can push network signal propagation dynamics into regimes of poor stability. We correct this behaviour with proper “noise-aware” weight initialisation. Despite this, noise also limits the depth to which networks are able to train successfully, and networks that do not exceed this depth limit demonstrate a surprising insensitivity to initialisation with regards to training speed and generalisation. In terms of inference, noisy neural network GPs perform best when their kernel parameters correspond to the new initialisation derived for noise regularised networks, and increasing the amount of injected noise leads to more constrained (simple) models with larger uncertainty (away from the training data). Lastly, we find our new technique that uses self-stabilising priors makes training deep BNNs more robust and leads to improved performance when compared to other state-of-the-art approaches.

AFRIKAANSE OPSOMMING: Innovasie in regulariseringstegnieke vir diep neurale netwerke is ’n belangrike aspek van die toenemende sukses van diepleer. Daar is egter beperkte teoretiese leiding in die ontwikkeling van hierdie tegnieke, en ons begrip van hoe verskeie suksesvolle regulariseringstegnieke funksioneer is steeds onvolledig. Hierdie proefskrif poog om ’n bydra te maak tot ’n beter begrip van regularisering in diepleer. Ons fokus spesifiek op benaderings wat van ruis gebruik maak om neurale netwerke te regulariseer. Een voorbeeld van so ’n tegniek wat gereeld gebruik word, is weglating (“dropout”) (Srivastava et al., 2014). Ons bydrae tot ruisregularisering span drie sleutelareas van modellering: (1) leer, (2) inisialisering en (3) inferensie. Ons ondersoek eers die leerdinamika van ’n eenvoudige klas van vlak ruisgeregulariseerde neurale netwerke, genaamd ontruisende outo-enkodeerders (DAEs) (Vincent et al., 2008), om ’n beter begrip te kry van hoe ruis die leerproses beïnvloed. In die eerste deel, neem ons waar dat leergedrag afhanklik is van inisialisering. Dit motiveer ’n studie waar ons die wisselwerking tussen ruis en die inisialisering van diep neurale netwerke bestudeer in terme van die dinamika van seinvloei gedurende die aanvanklike voorwaartse en terugwaartse deurvloei. Laastens word daar gekyk na die invloed van ruis in die Bayes-konteks. Hierdie studie fokus hoofsaaklik op vollediggekoppelde vorentoevoer neurale netwerke met die gerektifiseerde lineêre eenheid (ReLU) aktiveringsfunksie. Om die leerdinamika van DAEs te ontleed, lei ons geslotevorm oplossings af vir ’n stelsel ontkoppelde differensiaalvergelykings wat die verandering in skalaargewigte tydens die afrigproses beskryf, soos wat hulle na die eiewaardes van die toevoerkovariansiematriks neig (onder ’n gerieflike basisverandering). Wat inisialisering betref, gebruik ons gemiddelde-veldteorie om die verdeling van die voor-aktiverings van individuele neurone te benader, en gebruik ons dan hierdie resultate om rekursiewe vergelykings af te lei wat die seinvloei gedrag van ruisgeregulariseerde netwerke tydens die eerste voorwaartse en terugwaartse deurvloei beskryf. Ons gebruik hierdie vergelykings om nuwe inisialiseringskemas vir ruisgeregulariseerde neurale netwerke wat stabiele seinvloei verseker te verkry. Aangesien hierdie analise slegs geldig is tydens inisialisering, voer ons ’n grootskaalse gekontroleerde eksperiment uit deur duisende netwerke af te rig volgens ’n geskikte eksperimentele ontwerp, om sodoende die effek van inisialisering op die afrigspoed en veralgemening van die afrigte netwerke te toets. Om lig te werp op die effek van ruis op inferensie, ontwikkel ons ’n verband tussen ewekansigge geïnitialiseerde diep ruisgeregulariseerde neurale netwerke en Gaussiese prosesse (GPs)—nie-parametriese modelle wat presiese Bayesiaanse inferensie uitvoer—sowel as nuwe verbande tussen ’n spesifieke inisialisering van die netwerk en die optrede van die ooreenstemmende GP. Die proefstuk word afgesluit met ’n toepassing van seinvloeiteorie op Bayesiaanse inferensie in diep neurale netwerke, waar ons ’n nuwe tegniek, wat gebruik maak van “selfstabiliserende priors”, ontwikkel. Ons kernbevindinge is as volg: ruisregularisering help ’n model om te fokus op die meer prominente statistiese reëlmatighede in die verdeling van die afrigdata, wat vir latere veralgemening nuttig behoort te wees. As die netwerk egter diep is en nie behoorlik geïnisialiseer is nie, kan ruis veroorsaak dat die dinamika van die netwerk se seinvloei onstabiel raak. Ons kan egter hierdie optrede teenwerk met behoorlike “ruisbewuste” gewigsinisialisering. Nietemin, beperk ruis die diepte waartoe netwerke suksesvol afgerig kan word, en netwerke wat nie hierdie dieptebeperking oorskry nie, toon ’n verbasende onsensitiwiteit tot inisialisering met betrekking tot hul afrigspoed en veralgemening. Wat inferensie betref, presteer ruisige neurale netwerk GPs die beste wanneer hul kernparameters ooreenstem met die nuwe inisialisering wat afgelei is vir ruisgeregulariseerde netwerke, en ’n toename in die hoeveelheid ruis wat bygevoeg word lei tot meer beperkte (eenvoudige) modelle met groter onsekerheid (weg van die afrigtingsdata). Laastens vind ons dat ons nuwe tegniek, wat gebruik maak van “selfstabiliserende priors”, die afrigting van Bayesiaanse neurale netwerke meer robuust maak en tot verbeterde prestasie lei in vergelyking met ander moderne benaderings.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/107035
This item appears in the following collections: