Extracting failure modes from unstructured, natural language text

Malan, Francina (2020-03)

Thesis (MEng)--Stellenbosch University, 2020.

Thesis

ENGLISH ABSTRACT: This thesis investigates whether text mining (and the related fields of machine learning and natural language processing) can be used to extract useful information, specifically failure modes, from the low quality, unstructured text records available in industry. Failure data, and particularly information about failure modes, is imperative for good asset management, but frequently goes underutilised because it is buried in unstructured text which is not amenable to traditional analytics, but is too resource intensive to process manually. While the ideal solution would be to improve the information management system to prevent the collection of such data, this only addresses the quality of future data while years of historic data will then be lost. Several authors have acknowledged the prevalence of text-based maintenance records, identifying both the potential value and problems in utilising this data, with many suggesting some form of text mining as a possible solution. Within this and related fields, there is a gap between the academic and industry focussed literature. This pertains to both the scarcity of industry (and especially maintenance specific) research and the inadequate attention given to the theoretical basis of these fields in the available industry literature. The biggest concern pertains to the violation of the independent, identically distributed (IID) assumption in maintenance data and the impact this has on the validity of various evaluation schemes. Other concerns regard the optimisation of preprocessing parameters and the evaluation metric used to assess performance. This project was completed within the CRISP-DM framework. For the research objectives, both the more practical industry-focussed studies and the more theoretical, academic studies were investigated. In the experimental component, two families of algorithms were evaluated, namely Support Vector Machines and Naïve Bayes. The focus was on the validity of the modelling and evaluation process based on problems identified in literature. Noteworthy aspects of this procedure include using a blocked cross-validation as the outer, evaluation loop of a nested crossvalidation to account for the IID violation and to prevent the over-optimisation that can occur from single-loop cross-validation. The most important contribution of this work is the experimental design which consolidates multiple validity concerns raised in academic literature but receive limited attention in industry. In particular, it addresses the violation of the IID assumption in standard cross-validations (Bergmeir and Benitez, 2012), the importance of including preprocessing into the model optimisation (Krstajic et al., 2014), the high potential of randomised search optimisation (Bergstra and Bengio, 2012) and the different formulations of the cross-validated F-score (Forman and Scholz, 2010). The recommendations made by authors investigating these issues in isolation were combined to form the experimental design. It is however worth noting that the methodological conclusions made in this study are based on the evaluation of a single dataset and is not necessarily indicative of the general behaviour. The project concludes that while text mining offers a viable solution for the identified problem, doing so is not a trivial process and would require substantial commitment from organisations wishing to utilise their data.

AFRIKAANSE OPSOMMING: Hierdie tesis ondersoek die moontlikheid of teks ontginning (en die aanverwante velde van masjienleer en natuurlike taal prosessering) gebruik kan word om bruikbare inligting (spesifiek falings modusse) te bekom van die lae gehalte, ongestruktureerde teks rekords wat in die industrie beskikbaar is. Falings data, en spesifiek inligting rakende falings modusse, is onontbeerlik vir goeie batebestuur, maar word dikwels onbenut omdat dit in teks formaat versamel word wat nie geskik is vir tradisionele data-analises nie, maar ook te hulpbronintensief is om met die hand te verwerk. Alhoewel die ideale oplossing sou wees om die inligting-bestuur-sisteem te verbeter om te voorkom dat sulke data versamel word, sal dit slegs die kwaliteit van toekomstige data aanspreek terwyl jare van historiese data verlore sal gaan. Verskeie skrywers bevestig dat teksgebaseerde instandhoudings-rekords algemeen is, en identifiseer beide die potensiële waarde en die probleme in die gebruik hiervan, wat baie lei tot die voorstel om teks ontginning te gebruik. Binne hierdie velde is daar 'n gaping tussen die akademiese en industrie-gefokusde literatuur. Dit het betrekking tot die skaarsheid van industrie (en veral instandhoudings-spesifieke) navorsing en die onvoldoende aandag wat aan die teoretiese basis van hierdie velde gegee word in die beskikbare industrie-literatuur. Die grootste bekommernis is die oortreding van die IID aanname in instandhoudings data en die impak wat dit op evaluasie skemas het. Ander bekommernisse is die optimalisering van voorafprosessering parameters, en die evaluasie-maatstaf wat gebruik word. Hierdie projek is gedoen binne die CRISP-DM raamwerk. Beide die meer praktiese industrie-gefokusde studies en die meer teoretiese, akademiese studies is ondersoek. In die eksperimentele komponent is twee algoritme klasse ge-evalueer: Support Vector Machines en Naïve Bayes. Die fokus was op die geldigheid van die modellering- en evaluasie proses, gebaseer op probleme soos geïdentifiseer in literatuur. Opvallende aspekte in hierdie prosedure is die gebruik van geblokkeerde kruis-verifiëring in die buitenste evaluasie lus van 'n geneste kruis-verifiëring om rekenskap te gee van die IID skending en te voorkom dat oor-optimalisering kan gebeur van enkel lus kruis-verifiëring. Die mees belangrike bydrae van hierdie werk is die eksperimentele ontwerp wat veelvoudige geldigheids-bekommernisse konsolideer, wat reeds in akademiese literatuur genoem is, maar weinig aandag kry in die industrie. In besonder adresseer dit die oortreding van die IID aanname in standaard kruis-verifiërings (Bergmeir en Benitez, 2012), die belangrikheid daarvan om vooraf-prosessering in te sluit in die model optimalisering (Krstajic et al., 2014), die hoë potensiaal van lukrake soek-optimalisering (Bergstra en Bengio, 2012) en die verskillende formulerings van die kruis-geverifieerde F-telling (Forman en Scholz, 2010). Die aanbevelings van skrywers wat hierdie probleme in isolasie nagevors het, word gekombineer om die eksperimentele ontwerp te vorm. Dit is egter nodig om te noem dat die metodologiese bevindinge uit hierdie studie gebaseer is op die evaluasie van 'n enkele datastel en nie noodwendig aanduidend is van algemene gedrag nie. Die projek se bevinding is dat alhoewel teks ontginning 'n oplossing bied vir die geïdentifiseerde probleem, dit nie 'n maklike proses is nie en vereis substansiële toewyding van organisasies wat hul data wil benut.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/107813
This item appears in the following collections: