dc.contributor.advisor | Brink, Willie | en_ZA |
dc.contributor.advisor | Herbst, B. M. | en_ZA |
dc.contributor.author | Grond, Marco Marten | en_ZA |
dc.contributor.other | Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences. Applied Mathematics | en_ZA |
dc.date.accessioned | 2017-02-21T06:58:58Z | |
dc.date.accessioned | 2017-03-29T11:56:11Z | |
dc.date.available | 2017-02-21T06:58:58Z | |
dc.date.available | 2017-03-29T11:56:11Z | |
dc.date.issued | 2017-03 | |
dc.identifier.uri | http://hdl.handle.net/10019.1/100999 | |
dc.description | Thesis (MSc)--Stellenbosch University, 2017 | en_ZA |
dc.description.abstract | ENGLISH ABSTRACT : In this study we attempt to solve the problem of text detection in natural
images. This requires us to identify regions in a natural image that contain
text. Possible applications range from assistive technology, human computer
interaction and context extraction. Although humans find the task almost
trivial, large variations in colour, font, size and orientation must be accounted
for, and text shares many features and structures with other objects that cause
complications when attempting to automate a solution.
We train multiple convolutional neural networks in an attempt to solve
this problem. We chose convolutional neural networks both because they have
already displayed potential in the context of text recognition, and to better
understand how they operate. A sliding window approach is taken, where
smaller regions of a full image are classified separately before the results are
combined to identify text regions in the full image. Due to an insufficient
number of annotated natural training images, we create a supplementary synthetic
dataset. Using the synthetic data as a starting point we train networks
of different structures, after which the same networks are finetuned on smaller
natural datasets.
Networks first trained on the synthetic data outperform networks trained
solely on the smaller natural datasets, regardless of structure complexity. This
is likely due to an inability to identify relevant features from a limited number
of training examples. Our experiments further show that a larger network
structure is required for generalization, and that smaller datasets are prone to
overfitting.
We apply our best performing trained network to the task of detecting text
in full images, by extracting and classifying regions in an image using a sliding
window. Image pyramids are also implemented to allow for greater variance
in the size of text that can be detected. We find, however, that implementing
image pyramids only slightly improves the accuracy over a single image, likely
due to the fact that some scale variation was already present in the network’s
training set.
Ultimately, we find that convolutional neural networks show promise for
the task of text detection in natural images. We also find that training a
network on synthetic data and finetuning it on natural data improves the
overall accuracy. | en_ZA |
dc.description.abstract | AFRIKAANSE OPSOMMING : In hierdie studie poog ons om teks in natuurlike beelde op te spoor. Die
probleem vereis die identifisering van areas in ’n natuurlike beeld wat teks
bevat. Moontlike toepassings sluit in ondersteuningstegnologie, mens-rekenaar
interaksie en die onttrekking van konteks. Alhoewel ’n mens die taak baie
maklik mag vind, moet variasies in kleur, lettertipe, grootte en oriëntasie in
ag geneem word. Teks deel ook sekere kenmerke met ander beeldstrukture,
wat die outomatisering van ’n oplossing verder kompliseer.
Ons poog om die probleem op te los deur verskeie konvolusie-netwerke
vir die taak af te rig. Ons het besluit op hierdie soort neurale netwerke,
aangesien hulle alreeds potensiaal in die konteks van teksherkenning getoon
het, en ook om ’n beter begrip te ontwikkel oor hoe hulle werk. Ons onttrek
kleiner vensters uit die beeld, klassifiseer elkeen afsonderlik, en kombineer dan
die klassifikasies om areas van teks in die volle beeld te identifiseer. Vanweë ’n
tekort aan geannoteerde data skep ons ’n aanvullende datastel van sintetiese
beelde. Deur die sintetiese beelde as beginpunt te gebruik, rig ons verskeie
netwerke met verskillende strukture af, waarna ons die netwerke met behulp
van natuurlike data verfyn.
Netwerke wat eers op sintetiese data afgerig is vaar beter as dié wat slegs
op natuurlike data afgerig is, ongeag netwerkstruktuur. Dit is moontlik te
danke aan die feit dat ’n netwerk nie relevante kenmerke van teks uit min
data kan identifiseer nie. Dit blyk verder uit ons eksperimente dat groter
netwerkstrukture nodig is vir beter veralgemening, en dat kleiner datastelle
oormatige passing tot gevolg kan hê.
Ons gebruik die beste afgerigte netwerk om teks in volle beelde op te spoor,
deur vensters uit ’n beeld te onttrek en hulle te klassifiseer. Beeld-piramides
word verder gebruik om die netwerke toe te laat om ’n groter variasie in die
grootte van teks te kan identifiseer. Die gebruik van beeld-piramides het egter
’n klein impak op akkuraatheid, waarskynlik te danke aan die feit dat die
netwerke reeds afgerig was op teks van verskeie groottes.
Deur die loop van hierdie studie het ons tot die gevolgtrekking gekom dat
konvolusie-netwerke geskik kan wees om teks in natuurlike beelde op te spoor.
Ons het ook gevind dat afrigting op sintetiese data en verfyning op natuurlike
data die akkuraatheid van ’n netwerk kan verbeter. | af_ZA |
dc.format.extent | v, 77 pages ; colour illustrations | en_ZA |
dc.language.iso | en_ZA | en_ZA |
dc.publisher | Stellenbosch : Stellenbosch University | en_ZA |
dc.subject | Text detection | en_ZA |
dc.subject | Convolutional neural networks | en_ZA |
dc.subject | Computer vision | en_ZA |
dc.subject | Machine learning | en_ZA |
dc.subject | UCTD | en_ZA |
dc.title | Text detection in natural images using convolutional neural networks | en_ZA |
dc.type | Thesis | en_ZA |
dc.rights.holder | Stellenbosch University | en_ZA |