Development of a Big Data analytics demonstrator

Butler, Rhett Desmond (2018-12)

Thesis (MEng)--Stellenbosch University, 2018.

Thesis

ENGLISH ABSTRACT: The continued development of the information era has established the term `Big Data' and large datasets are now easily created and stored. Now humanity begins to understand the value of data, and more importantly, that valuable insights are captured within data. To uncover and convert these insights into value, various mathematical and statistical techniques are combined with powerful computing capabilities to perform analytics. This process is described by the term `data science'. Machine learning is part of data analytics and is based on some of the mathematical techniques available. The ability of the industrial engineer to integrate systems and incorporate new technological developments benefiting business makes it inevitable that the industrial engineering domain will also be involved in data analytics. The aim of this study was to develop a demonstrator so that the industrial engineering domain can learn from it and have first-hand knowledge in order to better understand a Big Data Analytics system. This study describes how the demonstrator as a system was developed, what practical obstacles were encountered as well as the techniques currently available to analyse large datasets for new insights. An architecture has been developed based on existing but somewhat limited literature and a hardware implementation has been done accordingly. For the purpose of this study, three computers were used: the first was configured as the master node and the other two as slave nodes. Software that coordinates and executes the analysis was identified and used to analyse various test datasets available in the public domain. The datasets are in different formats which require different machine learning techniques. These include, among others, regression under supervised learning, and k-means under unsupervised learning. The performance of this system is compared with a conventional analytics configuration, in which only one computer is used. The criteria used were 1) The time to analyse a dataset using a given technique and 2) the accuracy of the predictions made by the demonstrator and conventional system. The results were determined for several datasets, and it was found that smaller data sets were analysed faster by the conventional system, but it could not handle larger datasets. The demonstrator performed very well with larger datasets and all the machine learning techniques applied to it.

AFRIKAANSE OPSOMMING: Die volgehoue ontwikkeling van die inligting-era het die term `Groot Data' gevestig en reuse-datastelle word deesdae met gemak geskep en gestoor. Belangriker is dat die mensdom die waarde van data begin begryp, en meer nog, dat daar waardevolle geheime in data opgesluit kan lê. Om hierdie geheime te ontbloot en om te skakel sodat dit besigheidswaarde het word verskeie wiskundige en statistiese ontledingstegnieke tesame met kragtige rekenaarvermoë saamgespan vir ontledings. Hierdie aksie word beskryf deur die term `datawetenskap'. Masjienleer is deel van data-analitika en word baseer op sommige van die wiskundige tegnieke beskikbaar. Die bedryfsingenieur se vermoë om stelsels te integreer en nuwe ontwikkelings tot voordeel van ondernemings in te span maak dit onafwendbaar dat die bedryfsingenieurswese-domein ook betrokke sal raak by data-analitika. Die doel van hierdie studie was om 'n demonstreerder te ontwikkel sodat die bedryfsingenieurswese-domein daaruit kan leer en eerstehandse kennis kan hê ten einde 'n Groot Data-stelsel beter te verstaan. Hierdie studie beskryf hoe die demonstreerder as stelsel ontwikkel is, watter praktiese struikelblokke tegekom is asook die tegnieke tans beskikbaar om groot datastelle vir waarde te ontleed. 'n Argitektuur is ontwikkel gebaseer op bestaande, maar ietwat beperkte literatuur en 'n hardeware-implementering is daarvolgens gedoen. Vir die doel van die studie is drie rekenaars gebruik: een wat dien as die meester en twee as slawe. Programmatuur wat die analise kordineer en uitvoer is identifiseer en gebruik om verskeie toetsdatastelle wat in die openbare domein beskikbaar is, te ontleed. Die datastelle is in verskillende formate wat verskillende masjienleertegnieke vereis. Dit sluit in onder andere regressie onder geleide leer, en k-gemiddeldes onder ongeleide leer. Die prestasie van die stelsel is vergelyk met 'n konvensionele opstelling waarin slegs een rekenaar gebruik is. Die maatstawwe wat gebruik was, is 1) tyd om 'n datastel te ontleed met 'n gegewe tegniek en 2) die akkuraatheid van die voorspellings gemaak deur die demonstreerder en konvensionele stelsel. Die resultate is vir verskeie datastelle bepaal, en dit is gevind dat kleiner datastelle vinniger deur die konvensionele stelsel ontleed word, maar dat dit nie groot datastelle kan hanteer nie. Die demonstreerder het baie goed presteer met groot datastelle en al die masjienleertegnieke wat daarop toegepas is.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/104869
This item appears in the following collections: