Reinforcement learning : theory, methods and application to decision support systems

Mouton, Hildegarde Suzanne (2010-12)

Thesis (MSc (Applied Mathematics))--University of Stellenbosch, 2010.

Thesis

ENGLISH ABSTRACT: In this dissertation we study the machine learning subfield of Reinforcement Learning (RL). After developing a coherent background, we apply a Monte Carlo (MC) control algorithm with exploring starts (MCES), as well as an off-policy Temporal-Difference (TD) learning control algorithm, Q-learning, to a simplified version of the Weapon Assignment (WA) problem. For the MCES control algorithm, a discount parameter of τ = 1 is used. This gives very promising results when applied to 7 × 7 grids, as well as 71 × 71 grids. The same discount parameter cannot be applied to the Q-learning algorithm, as it causes the Q-values to diverge. We take a greedy approach, setting ε = 0, and vary the learning rate (α ) and the discount parameter (τ). Experimentation shows that the best results are found with set to 0.1 and constrained in the region 0.4 ≤ τ ≤ 0.7. The MC control algorithm with exploring starts gives promising results when applied to the WA problem. It performs significantly better than the off-policy TD algorithm, Q-learning, even though it is almost twice as slow. The modern battlefield is a fast paced, information rich environment, where discovery of intent, situation awareness and the rapid evolution of concepts of operation and doctrine are critical success factors. Combining the techniques investigated and tested in this work with other techniques in Artificial Intelligence (AI) and modern computational techniques may hold the key to solving some of the problems we now face in warfare.

AFRIKAANSE OPSOMMING: Die fokus van hierdie verhandeling is die masjienleer-algoritmes in die veld van versterkingsleer. ’n Koherente agtergrond van die veld word gevolg deur die toepassing van ’n Monte Carlo (MC) beheer-algoritme met ondersoekende begintoestande, sowel as ’n afbeleid Temporale-Verskil beheer-algoritme, Q-leer, op ’n vereenvoudigde weergawe van die wapentoekenningsprobleem. Vir die MC beheer-algoritme word ’n afslagparameter van τ = 1 gebruik. Dit lewer belowende resultate wanneer toegepas op 7 × 7 roosters, asook op 71 × 71 roosters. Dieselfde afslagparameter kan nie op die Q-leer algoritme toegepas word nie, aangesien dit veroorsaak dat die Q-waardes divergeer. Ons neem ’n gulsige aanslag deur die gulsigheidsparameter te verstel na ε = 0. Ons varieer dan die leertempo ( α) en die afslagparameter (τ). Die beste eksperimentele resultate is behaal wanneer = 0.1 en as die afslagparameter vasgehou word in die gebied 0.4 ≤ τ ≤ 0.7. Die MC beheer-algoritme lewer belowende resultate wanneer toegepas op die wapentoekenningsprobleem. Dit lewer beduidend beter resultate as die Q-leer algoritme, al neem dit omtrent twee keer so lank om uit te voer. Die moderne slagveld is ’n omgewing ryk aan inligting, waar dit kritiek belangrik is om vinnig die vyand se planne te verstaan, om bedag te wees op die omgewing en die konteks van gebeure, en waar die snelle ontwikkeling van die konsepte van operasie en doktrine lei tot sukses. Die tegniekes wat in die verhandeling ondersoek en getoets is, en ander kunsmatige intelligensie tegnieke en moderne berekeningstegnieke saamgesnoer, mag dalk die sleutel hou tot die oplossing van die probleme wat ons tans in die gesig staar in oorlogvoering.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/5304
This item appears in the following collections: