ITEM VIEW

Development of a Big Data analytics demonstrator

dc.contributor.advisorBekker, Jamesen_ZA
dc.contributor.authorButler, Rhett Desmonden_ZA
dc.contributor.otherStellenbosch University. Faculty of Engineering. Dept. of Industrial Engineering.en_ZA
dc.date.accessioned2018-11-15T08:32:35Z
dc.date.accessioned2018-12-07T06:48:29Z
dc.date.available2018-11-15T08:32:35Z
dc.date.available2018-12-07T06:48:29Z
dc.date.issued2018-12
dc.identifier.urihttp://hdl.handle.net/10019.1/104869
dc.descriptionThesis (MEng)--Stellenbosch University, 2018.en_ZA
dc.description.abstractENGLISH ABSTRACT: The continued development of the information era has established the term `Big Data' and large datasets are now easily created and stored. Now humanity begins to understand the value of data, and more importantly, that valuable insights are captured within data. To uncover and convert these insights into value, various mathematical and statistical techniques are combined with powerful computing capabilities to perform analytics. This process is described by the term `data science'. Machine learning is part of data analytics and is based on some of the mathematical techniques available. The ability of the industrial engineer to integrate systems and incorporate new technological developments benefiting business makes it inevitable that the industrial engineering domain will also be involved in data analytics. The aim of this study was to develop a demonstrator so that the industrial engineering domain can learn from it and have first-hand knowledge in order to better understand a Big Data Analytics system. This study describes how the demonstrator as a system was developed, what practical obstacles were encountered as well as the techniques currently available to analyse large datasets for new insights. An architecture has been developed based on existing but somewhat limited literature and a hardware implementation has been done accordingly. For the purpose of this study, three computers were used: the first was configured as the master node and the other two as slave nodes. Software that coordinates and executes the analysis was identified and used to analyse various test datasets available in the public domain. The datasets are in different formats which require different machine learning techniques. These include, among others, regression under supervised learning, and k-means under unsupervised learning. The performance of this system is compared with a conventional analytics configuration, in which only one computer is used. The criteria used were 1) The time to analyse a dataset using a given technique and 2) the accuracy of the predictions made by the demonstrator and conventional system. The results were determined for several datasets, and it was found that smaller data sets were analysed faster by the conventional system, but it could not handle larger datasets. The demonstrator performed very well with larger datasets and all the machine learning techniques applied to it.en_ZA
dc.description.abstractAFRIKAANSE OPSOMMING: Die volgehoue ontwikkeling van die inligting-era het die term `Groot Data' gevestig en reuse-datastelle word deesdae met gemak geskep en gestoor. Belangriker is dat die mensdom die waarde van data begin begryp, en meer nog, dat daar waardevolle geheime in data opgesluit kan lê. Om hierdie geheime te ontbloot en om te skakel sodat dit besigheidswaarde het word verskeie wiskundige en statistiese ontledingstegnieke tesame met kragtige rekenaarvermoë saamgespan vir ontledings. Hierdie aksie word beskryf deur die term `datawetenskap'. Masjienleer is deel van data-analitika en word baseer op sommige van die wiskundige tegnieke beskikbaar. Die bedryfsingenieur se vermoë om stelsels te integreer en nuwe ontwikkelings tot voordeel van ondernemings in te span maak dit onafwendbaar dat die bedryfsingenieurswese-domein ook betrokke sal raak by data-analitika. Die doel van hierdie studie was om 'n demonstreerder te ontwikkel sodat die bedryfsingenieurswese-domein daaruit kan leer en eerstehandse kennis kan hê ten einde 'n Groot Data-stelsel beter te verstaan. Hierdie studie beskryf hoe die demonstreerder as stelsel ontwikkel is, watter praktiese struikelblokke tegekom is asook die tegnieke tans beskikbaar om groot datastelle vir waarde te ontleed. 'n Argitektuur is ontwikkel gebaseer op bestaande, maar ietwat beperkte literatuur en 'n hardeware-implementering is daarvolgens gedoen. Vir die doel van die studie is drie rekenaars gebruik: een wat dien as die meester en twee as slawe. Programmatuur wat die analise kordineer en uitvoer is identifiseer en gebruik om verskeie toetsdatastelle wat in die openbare domein beskikbaar is, te ontleed. Die datastelle is in verskillende formate wat verskillende masjienleertegnieke vereis. Dit sluit in onder andere regressie onder geleide leer, en k-gemiddeldes onder ongeleide leer. Die prestasie van die stelsel is vergelyk met 'n konvensionele opstelling waarin slegs een rekenaar gebruik is. Die maatstawwe wat gebruik was, is 1) tyd om 'n datastel te ontleed met 'n gegewe tegniek en 2) die akkuraatheid van die voorspellings gemaak deur die demonstreerder en konvensionele stelsel. Die resultate is vir verskeie datastelle bepaal, en dit is gevind dat kleiner datastelle vinniger deur die konvensionele stelsel ontleed word, maar dat dit nie groot datastelle kan hanteer nie. Die demonstreerder het baie goed presteer met groot datastelle en al die masjienleertegnieke wat daarop toegepas is.en_ZA
dc.format.extent288 pages : illustrationsen_ZA
dc.language.isoen_ZAen_ZA
dc.publisherStellenbosch : Stellenbosch Universityen_ZA
dc.subjectIndustrial Engineeringen_ZA
dc.subjectUCTDen_ZA
dc.subjectBig data -- Analysisen_ZA
dc.subjectMachine learningen_ZA
dc.subjectWeb analyticsen_ZA
dc.titleDevelopment of a Big Data analytics demonstratoren_ZA
dc.typeThesisen_ZA
dc.rights.holderStellenbosch Universityen_ZA


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

ITEM VIEW