The use of classification methods for gross error detection in process data

Gerber, Egardt (2013-12)

Thesis (MScEng)-- Stellenbosch University, 2013.

Thesis

ENGLISH ABSTRACT: All process measurements contain some element of error. Typically, a distinction is made between random errors, with zero expected value, and gross errors with non-zero magnitude. Data Reconciliation (DR) and Gross Error Detection (GED) comprise a collection of techniques designed to attenuate measurement errors in process data in order to reduce the effect of the errors on subsequent use of the data. DR proceeds by finding the optimum adjustments so that reconciled measurement data satisfy imposed process constraints, such as material and energy balances. The DR solution is optimal under the assumed statistical random error model, typically Gaussian with zero mean and known covariance. The presence of outliers and gross errors in the measurements or imposed process constraints invalidates the assumptions underlying DR, so that the DR solution may become biased. GED is required to detect, identify and remove or otherwise compensate for the gross errors. Typically GED relies on formal hypothesis testing of constraint residuals or measurement adjustment-based statistics derived from the assumed random error statistical model. Classification methodologies are methods by which observations are classified as belonging to one of several possible groups. For the GED problem, artificial neural networks (ANN’s) have been applied historically to resolve the classification of a data set as either containing or not containing a gross error. The hypothesis investigated in this thesis is that classification methodologies, specifically classification trees (CT) and linear or quadratic classification functions (LCF, QCF), may provide an alternative to the classical GED techniques. This hypothesis is tested via the modelling of a simple steady-state process unit with associated simulated process measurements. DR is performed on the simulated process measurements in order to satisfy one linear and two nonlinear material conservation constraints. Selected features from the DR procedure and process constraints are incorporated into two separate input vectors for classifier construction. The performance of the classification methodologies developed on each input vector is compared with the classical measurement test in order to address the posed hypothesis. General trends in the results are as follows: - The power to detect and/or identify a gross error is a strong function of the gross error magnitude as well as location for all the classification methodologies as well as the measurement test. - For some locations there exist large differences between the power to detect a gross error and the power to identify it correctly. This is consistent over all the classifiers and their associated measurement tests, and indicates significant smearing of gross errors. - In general, the classification methodologies have higher power for equivalent type I error than the measurement test. - The measurement test is superior for small magnitude gross errors, and for specific locations, depending on which classification methodology it is compared with. There is significant scope to extend the work to more complex processes and constraints, including dynamic processes with multiple gross errors in the system. Further investigation into the optimal selection of input vector elements for the classification methodologies is also required.

AFRIKAANSE OPSOMMING: Alle prosesmetings bevat ʼn sekere mate van metingsfoute. Die fout-element van ʼn prosesmeting word dikwels uitgedruk as bestaande uit ʼn ewekansige fout met nul verwagte waarde, asook ʼn nie-ewekansige fout met ʼn beduidende grootte. Data Rekonsiliasie (DR) en Fout Opsporing (FO) is ʼn versameling van tegnieke met die doelwit om die effek van sulke foute in prosesdata op die daaropvolgende aanwending van die data te verminder. DR word uitgevoer deur die optimale veranderinge aan die oorspronklike prosesmetings aan te bring sodat die aangepaste metings sekere prosesmodelle gehoorsaam, tipies massa- en energie-balanse. Die DR-oplossing is optimaal, mits die statistiese aannames rakende die ewekansige fout-element in die prosesdata geldig is. Dit word tipies aanvaar dat die fout-element normaal verdeel is, met nul verwagte waarde, en ʼn gegewe kovariansie matriks. Wanneer nie-ewekansige foute in die data teenwoordig is, kan die resultate van DR sydig wees. FO is daarom nodig om nie-ewekansige foute te vind (Deteksie) en te identifiseer (Identifikasie). FO maak gewoonlik staat op die statistiese eienskappe van die meting aanpassings wat gemaak word deur die DR prosedure, of die afwykingsverskil van die model vergelykings, om formele hipoteses rakende die teenwoordigheid van nie-ewekansige foute te toets. Klassifikasie tegnieke word gebruik om die klasverwantskap van observasies te bepaal. Rakende die FO probleem, is sintetiese neurale netwerke (SNN) histories aangewend om die Deteksie en Identifikasie probleme op te los. Die hipotese van hierdie tesis is dat klassifikasie tegnieke, spesifiek klassifikasiebome (CT) en lineêre asook kwadratiese klassifikasie funksies (LCF en QCF), suksesvol aangewend kan word om die FO probleem op te los. Die hipotese word ondersoek deur middel van ʼn simulasie rondom ʼn eenvoudige gestadigde toestand proses-eenheid wat aan een lineêre en twee nie-lineêre vergelykings onderhewig is. Kunsmatige prosesmetings word geskep met behulp van lukrake syfers sodat die foutkomponent van elke prosesmeting bekend is. DR word toegepas op die kunsmatige data, en die DR resultate word gebruik om twee verskillende insetvektore vir die klassifikasie tegnieke te skep. Die prestasie van die klassifikasie metodes word vergelyk met die metingstoets van klassieke FO ten einde die gestelde hipotese te beantwoord. Die onderliggende tendense in die resultate is soos volg: - Die vermoë om ‘n nie-ewekansige fout op te spoor en te identifiseer is sterk afhanklik van die grootte asook die ligging van die fout vir al die klassifikasie tegnieke sowel as die metingstoets. - Vir sekere liggings van die nie-ewekansige fout is daar ‘n groot verskil tussen die vermoë om die fout op te spoor, en die vermoë om die fout te identifiseer, wat dui op smering van die fout. Al die klassifikasie tegnieke asook die metingstoets baar hierdie eienskap. - Oor die algemeen toon die klassifikasie metodes groter sukses as die metingstoets. - Die metingstoets is meer suksesvol vir relatief klein nie-ewekansige foute, asook vir sekere liggings van die nie-ewekansige fout, afhangende van die klassifikasie tegniek ter sprake. Daar is verskeie maniere om die bestek van hierdie ondersoek uit te brei. Meer komplekse, niegestadigde prosesse met sterk nie-lineêre prosesmodelle en meervuldige nie-ewekansige foute kan ondersoek word. Die moontlikheid bestaan ook om die prestasie van klassifikasie metodes te verbeter deur die gepaste keuse van insetvektor elemente.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/85856
This item appears in the following collections: