Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant

Coomans, Cornelius Johannes (2017-03)

Thesis (MEng)--Stellenbosch University, 2017.

Thesis

ENGLISH SUMMARY: The lag time associated with water quality monitoring at water reclamation plants (WRPs) is a major hurdle in the way of implementing potable water reclamation in areas suffering from water shortages. The application of advanced monitoring techniques, which rely in part on surrogate and indicator variables, are one way of reducing the lag time associated with water quality monitoring. The aim of this study was to evaluate statistical analyses that could be used to identify variable relationships, which in turn could be used for the development of surrogate and indicator variables, following the data-driven approach. The plant data used in this study were obtained from an existing WRP that has been operational for more than five years without undergoing any major changes to the treatment and operational procedures. An initial assessment of the data found that the data contained large amounts of missing values. The assessment also identified the data periods during which the plant was operating under ‘normal’ conditions. Several time periods were removed since abnormal events occurred during these time periods. Pre-processing the data consisted of outlier removal (three sigma rule and Hampel filter), noise reduction (moving average filter) and missing data replacement (linear interpolation). The statistical analyses, Pearson’s and Spearman’s correlation, principal component analysis (PCA), linear discriminant analysis (LDA) and partial least squares (PLS) regression, were then incorporated into models for identifying variable relationships. The performance of the different statistical analyses were measured using statistical metrics such as R2 for correlation, visualisation of separation for PCA, classification error for LDA and both R2 and mean squared error (MSE) for the PLS models. The bivariate correlations provided the most concise results, whilst the LDA models could not be effectively assessed due to a change in the behaviour of the training and testing data. The PLS models performed poorly and did not produce any significant results. Expert process knowledge was also used to determine which variable relationships, identified by the models, could be regarded as valuable contributions, and which ought to be regarded as trivial. Overall it was found that the bivariate correlations were effective for detecting relationships between variables. PCA was a valuable tool that provided insight into the potential use of multivariate analyses. LDA and PLS regression may require further testing before a definitive ruling can be made regarding their usefulness for identifying variable relationships from unprocessed historical plant data. Although historical data could be used to identify variable relationships using bivariate correlations, it is not recommended for multivariate statistical analyses. A planned sampling campaign could be much more effective for data collection than using historical data, although the cost associated with a planned sampling campaign must be taken into consideration.

AFRIKAANS OPSOMMING: Die tydsverloop wat verband hou met watergehaltemonitering by waterherwinningswerke (WHW’s) is ʼn groot hindernis vir die implementering van drinkbarewaterherwinning in gebiede wat onder watertekorte gebuk gaan. Die toepassing van gevorderde moniteringstegnieke wat gedeeltelik staatmaak op surrogaat- en aanwyserveranderlikes is een manier om hierdie tydsverloop te verminder. Die doel van hierdie studie was om statistiese ontledings te evalueer wat gebruik kan word om veranderlike verhoudings, wat aangewend kan word vir die ontwikkeling van surrogaat- en aanwyserveranderlikes, op grond van die data-gedrewe benadering te identifiseer. Die aanlegdata wat vir hierdie navorsing gebruik is, verkry vanaf ʼn bestaande WHW wat reeds vir vyf jaar werksaam is sonder dat enige groot veranderinge aan behandelings en bedryfsprosedures ondergaan is. Deur ʼn aanvanklike assessering van die data is bevind dat die data groot hoeveelhede ontbrekende waardes bevat. Met die assessering is datatydperke ook geïdentifiseer waartydens die aanleg onder ‘normale’ omstandighede bedryf is. Verskeie tydperke is verwyder aangesien abnormale gebeure daartydens plaasgevind het. Voorverwerking van die data het begin met uitskieterverwydering (driesigma-reël en Hampel-filter), geraasvermindering (bewegendegemiddelde-filter) en ontbrekendedata-vervanging (lineêre interpolasie). Die statistiese ontledings, Pearson en Spearman se korrelasie, hoofkomponentontleding (PCA), lineêre diskriminantontleding (LDA) en gedeeltelike kleinste kwadrate- (PLS-)regressie is in modelle gebruik vir die identifisering van veranderlike verhoudings. Die prestasie van die statistiese ontledings is gemeet met behulp van statistiese maatstawwe soos R2 vir korrelasie, visualisering van skeiding vir PCA, klassifikasiefout vir LDA en sowel R2 as gemiddelde kwadraatfout vir die PLS-modelle. Die tweeveranderlike korrelasies het die bondigste resultate getoon, terwyl die LDA-modelle nie doeltreffend beoordeel kon word nie as gevolg van ʼn verandering in die gedrag van die opleiding- en toetsdata. Die PLS-modelle het swak presteer en het nie enige noemenswaardige resultate gelewer nie. Deskundige proseskennis is ook gebruik om te bepaal watter veranderlike verhoudings, wat deur die modelle geïdentifiseer is, as waardevolle bydraes beskou kon word, en watter as onbeduidend beskou behoort te word. In die algemeen is bevind dat die tweeveranderlike korrelasies doeltreffend was vir die identifisering van verwantskappe tussen veranderlikes. PCA was ʼn waardevolle instrument wat insig verskaf het in die potensiële gebruik van meerveranderlike ontledingstegnieke. LDA- en PLS-regressie vereis moontlik verdere toetsing voordat ʼn finale beslissing gemaak kan word met betrekking tot die nut daarvan vir die identifisering van veranderlike verhoudings deur gebruik te maak van onverwerkte historiese aanlegdata. Hoewel historiese data gebruik kon word om veranderlike verhoudings met behulp van tweeveranderlike korrelasies te identifiseer, word dit nie aanbeveel vir meerveranderlike statistiese ontledings nie. ʼn Beplande steekproefnemingsveldtog kan baie doeltreffender wees vir data-insameling as die gebruik van historiese data, hoewel die koste wat verband hou met ʼn beplande steekproefnemingsveldtog in ag geneem moet word.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/101158
This item appears in the following collections: