A framework for quantifying and characterising road accident risk : a data mining approach

Van Heerden, Shane Andrew (2020-03)

Thesis (PhD)--Stellenbosch University, 2020.

Thesis

ENGLISH ABSTRACT: According to the World Health Organisation, road accidents account for approximately 1:25 million deaths annually | the eighth leading cause of death worldwide. With the enormous losses to society resulting from road accidents, the prevention and severity reduction of road accidents has been an active area of research focus for many decades. Researchers frequently employ a variety of statistical learning techniques in an attempt to understand the factors contributing to higher levels of road accident risk. Such insights provide vital direction for governments with respect to safer road designs and the establishment of countermeasures aimed at reducing the number of road accidents. Furthermore, recent advances in machine learning have presented exciting new machine learning possibilities that were deemed far out of reach just over a decade ago. The tasks associated with data pre-processing in this context are, however, often daunting and immensely time-consuming. Moreover, the adoption of machine learning models in the road accident analysis literature has been relatively limited due to the uninterpretable nature of the majority of these models. A generic modular data mining framework is, therefore, proposed in this dissertation, aimed specifically at formalising and facilitating the tasks associated with road accident data preparation, and facilitating the interpretation of machine learning model output. This framework is designed to facilitate the configuration, enhancement and transformation of raw accident, vehicle, road and victim data into useful information which appropriately quantifies and characterises road accident risk. More specifically, this framework facilitates evaluation of road accident risk in terms of the rate and severity of being involved in a road accident along road segments and at road junctions based on historically recorded RAs. The configuration procedure in the proposed framework allows a user to format data attributes appropriately, as well as correct any missing or erroneous values that may exist in data sets. The enhancement procedure allows a user to merge vehicle and road records to a corresponding accident record for the purpose of creating an all-encompassing data set. It is also possible to construct new attributes based on current attribute values residing in the aforementioned data sets. After each of the individual data sets has been prepared appropriately and the data are deemed of a suficiently high quality, they may be stored in a database. Finally, the transformation procedure exploits these high-quality data to quantify the rate and severity of road accidents along road segments or at road junctions. These results serve as input to a standard supervised learning procedure in which road characteristics are used to predict these rate and severity measurements. In order to demonstrate the practical workability and usefulness of the proposed framework, a concept demonstrator of the framework is implemented in an existing data mining platform and applied to a real-world case study based on road accident data from Greater Manchester in the United Kingdom. Each of the individual data preparation components of the framework is tested in the context of this case study, while the effectiveness of the road accident risk evaluation approaches is demonstrated by means of multiple investigations.

AFRIKAANSE OPSOMMING: Volgens die Wêreldgesondheidsorganisasie veroorsaak padongelukke jaarliks sowat 1.25 miljoen sterftes — die agtste hoofoorsaak van sterftes wêreldwyd. Met die enorme verliese vir die samelewing wat padongelukke meebring, is die voorkoming en ernsvermindering van padongelukke vir dekades al ’n aktiewe navorsingsarea. Navorsers gebruik dikwels ’n verskeidenheid statistiese leertegnieke om die faktore wat tot hoër vlakke van padongelukrisiko bydra, te verstaan. Sulke insigte bied belangrike geleenthede vir regerings ten opsigte van die daarstelling van veiliger padontwerpe en teenmaatreëls wat daarop gemik is om die aantal padongelukke te verminder. Verder het onlangse vordering in masjienleer opwindende nuwe masjienleerbenaderings wat net meer as ’n dekade gelede as buite bereik beskou is, moontlik gemaak. Die take wat verband hou met datavoorbereiding in hierdie konteks, is egter dikwels uitdagend en baie tydrowend. Die aanvaarding van masjienleermodelle in die literatuur oor die ontleding van padongelukke was ook betreklik beperk vanweë die oninterpreteerbare aard van die meerderheid van hierdie modelle. ’n Generiese modulêre data-ontginningsraamwerk word dus in hierdie proefskrif voorgestel, wat spesifiek gemik is op die formalisering en fasilitering van die take wat verband hou met die voorbereiding van padongelukdata, asook die fasilitering van die interpretasie van kragtige masjienleer algoritme afvoer. Hierdie raamwerk is ontwerp om die konfigurasie, verbetering en transformasie van rou ongeluks-, voertuig- en paddata na betekenisvolle inligting te fasiliteer wat padongelukrisiko toepaslik karakteriseer en kwantifiseer. Meer spesifiek, hierdie raamwerk fasiliteer die evaluasie van die risiko van padongelukke ten opsigte van die tempo en erns van die betrokkenheid by padongelukke langs padsegmente en by padkruisings gebaseer op historiese padongelukke. Die konfigurasieprosedure in die voorgestelde raamwerk stel die gebruiker in staat om data-eienskappe op gepaste wyse te formateer, sowel as om ontbrekende of foutiewe waardes wat in datastelle mag bestaan, reg te stel. Die verbeteringsprosedure stel die gebruiker in staat om voertuig- en padrekords saam te voeg tot ’n oorkoepelende rekord van padongelukke met die doel om ’n allesomvattende datastel te skep. Die konstruksie van nuwe attribute gebaseer op huidige attribuutwaardes wat in bogenoemde datastelle voorkom, is ook moontlik. Nadat elk van die individuele datastelle toepaslik voorberei is en die inligting as van voldoende hoë gehalte beskou word, kan die data in ’n databasis geberg word. Uiteindelik maak die transformasieprosedure van hierdie hoë-gehalte data gebruik om die tempo en erns van padongelukke langs padsegmente of by padkruisings te bepaal. Hierdie resultate dien as insette tot ’n standaard toesigleerproses waarin padkenmerke gebruik word om hierdie tempo- en ernsmetings te voorspel. Ten einde die praktiese werkbaarheid en bruikbaarheid van die voorgestelde raamwerk te demonstreer, word ’n prototipe van die raamwerk in ’n bestaande data-ontginningsagteware omgewing geïmplementeer en op ’n werklike gevallestudie toegepas wat gebaseer is op padongelukdata in die Groter Manchester-area van die Verenigde Koninkryk. Elk van die individuele data-voorbereidingskomponente van die raamwerk word in die konteks van hierdie gevallestudie getoets, terwyl die doeltreffendheid van die padongeluk risiko-evalueringsproses deur middel van veelvuldige ondersoeke gedemonstreer word.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/107748
This item appears in the following collections: