Exploratory data analysis and empirical modelling of stationary processes by use of genetic programming

Chemaly, Timothy Paul (1999-12)

Thesis (M.Ing.)--University of Stellenbosch, 1999.

Thesis

ENGLISH SUMMARY: Enhancing the performance of any process requires a detailed knowledge of the unknown system, with a mathematical model being the most common means of representing this knowledge. The most frequently used statistical techniques, assume that any relationships between input and output variables are linear and that the data itself is normally distributed. However, real world systems can be highly non-linear and linear approaches can therefore fail to predict the behaviour of the system accurately. Explicit specification of optimal structure in large non-linear models is often not practical and as a result, non-parametric methods (kernel regression, artificial neural networks, etc.) are usually employed. Although these models allow accurate representation of complex systems, they can be very difficult to interpret. This research project explores a novel approach to this problem of mathematical modelling which attempts to evolve optimal parametric models, based on the Darwinian mechanism of evolution. This approach, referred to as genetic programming (GP), facilitates development of explicit or implicit models, or any mix of these two extremes, as dictated by the problem and unlike other methods, it can handle a trade-off between accuracy and interpretability with great ease. During this research; a -commercial application (a-GP) was developed, since very few commercial systems are currently available. Some techniques were developed, which improved the performance ofthe original algorithm considerably. For instance, memory demands were decreased by a factor of 5 by utilizing a different implementation model. Improved convergence and robustness was obtained by using a correlation-based fitness function in conjunction with a correction filter which reduced the sum of the squared errors; at the expense of a more complex model. The evaluation process was expedited by evaluating each tree-like structure as a reverse polish expression; as opposed to a branch-node reduction technique. Additional execution speed was further obtained by implementing the algorithm in c++ (an object oriented compiled language) which is significantly faster than the original LISP (an interpreted language) implementation, . The newly improved algorithm, a-GP, was applied to four industrial data sets and the results were compared against other methods such as standard genetic programming, multilayer perceptron neural networks and linear regression. It was found that a-GP outperformed standard genetic programming on all four case studies, while improving on neural networks on half of the runs. The evolved models tended to be complex. This could be attributed to the lack of parameter estimation that the genetic programming algorithm tried to compensate for by evolving complex tree structures; which it used to approximate the parameters. As a data visualization tool, a-GP was applied to four bench marking data sets used extensively in the literature. The results acquired with a-GP compared favourably with those obtained by other methods with the additional benefit in that a-GP was able to evolve simple mapping functions, which clearly indicated how the variables related to the structure. Additionally, the algorithm was applied in the mapping of two industrial processes. The results showed distinct clustering tendencies within the data, indicating the different operating regimes of the processes under investigation.

AFRIKAANSE OPSOMMING: Om die vermoe van 'n proses stelsel te verbeter vereis 'n gedetaileerde kennis of model van die onderliggende proses. Die statistiese tegnieke wat meestal gebruik word, neem aan dat enige verwantskap tussen die intree- en die uittree-veranderlikes lineer is en dat die data self normaal versprei is. Ongelukkig is realistiese probleme dikwels nie-linieer en linieere benaderings kan nie die gedrag van sulke stelsels akkuraat karakteriseer nie. Die eksplisiete spesifikasie van die optimale struktuur in groot nie-linieere modelle is nie altyd prakties moontlik nie, met die gevolg dat nieparametriese metodes (basisfunksie-regressie, kunsmatige neurale netwerke, ens.) gebruik word. Alhoewel hierdie modelle akkurate voorstellings toelaat van komplekse sisteeme is hulle baie moeilik om te interpreteer. Hierdie tesis beskryf 'n unieke benadering tot die probleem van wiskundige modellering deur optimale parametriese modelle te evolueer, wat op die basiese beginsels van Darwin se evolusionere model berus. Hierdie tegniek, wat genetiese programmering (GP) heet, kan eksplisiete, sowel as implisiete modelle ontwikkel of enige kombinasie daarvan, soos gedikteer word deur die probleem. Anders as ander metodes, kan dit ook maklik 'n balans tussen akkuraatheid en interpreteerbaarheid handhaaf. Gedurende hierdie navorsing is 'n kommersiele sagteware-pakket (a-GP) ontwikkel, aangesien baie min sulke pakkette tans beskikbaar is. Verskeie tegnieke is ontwikkel wat die standaard algoritme aansienlik verbeter het. By voorbeeld, die aanvraag na geheue is verminder met 'n faktor van 5 deur gebruik te maak van 'n alternatiewe implementeringsmodel. Versnelde konvergensie en robuustheid was ook verkry deur gebruik te maak van 'n korrelasie-gebaseerde fiksheidfunksie in samewerking met 'n korreksiefilter wat die som van die gekwadreerde foute geminimeer het. Die evaluasieproses was ook versnel deur elke boomstruktuur as 'n omgekeerde Pooise vergelyking op te los, in plaas van 'n tak-node vereenvoudigingstegniek. Die verwerking was verder versnel deur die algoritme te implementeer in C++ ('n objek georienteerde, gekompileerde taal) wat aansienlik vinniger is as die oorspronklike LISP (ge"interpreteerde taal) implementering. Die nuwe verbeterde algoritme, a-GP, is toegepas op vier realistiese probleme met die doel om regressiemodelle te genereer en die resultate is vergelyk met die verkry deur ander tegnieke, 5005 standaard genetiese programmering, kunsmatige neurale netwerke en linieere regressie. Daar was gevind dat a-GP op die standaard genetiese programmering verbeter in al vier gevalle, terwyl dit op kunsmatige neurale netwerke verbeter het op een van die toetsstelle. Die mode lie het geneig om kompleks te wees, wat interpretasie bemoeilik het. Dit kan toegeskryf word aan die tekortkoming van 'n parameterbenadering, waarvoor die genetiese programmering algoritme probeer kompenseer deur komplekse boomstukture te ontwikkel. Die algoritme gebruik die strukture om die parameters af te skat. a-GP is ook gebruik om data te visualiseer. Die resultate op vier datastelle het gewys dat a-GP baie goed vergelyk met ander metodes, terwyl dit die addisionele voordeel gehad het, dat dit eenvoudige projeksie-funksies kon evolueer wat duidelike verwantskappe tussen die veranderlikes en die struktuur uitgewys het. Die algoritme was ook toegepas op die projeksie van twee industrieele stelsels na twee dimensies vir visualisering. Die resultate het duidelike trosvorming in die data uitgewys, wat 'n indikasie was van die verskillende operasionele toestande van die-prosesse.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/51267
This item appears in the following collections: