An investigation into performance-related issues of regular expression matching

dc.contributor.advisorVan der Merwe, Brinken_ZA
dc.contributor.advisorBester, Willemen_ZA
dc.contributor.authorVan Litsenborgh, Pieter Steynen_ZA
dc.contributor.otherStellenbosch University. Faculty of Science. Dept. of Computer Science.en_ZA
dc.date.accessioned2022-11-24T03:24:11Zen_ZA
dc.date.accessioned2023-01-16T12:47:16Zen_ZA
dc.date.available2022-11-24T03:24:11Zen_ZA
dc.date.available2023-01-16T12:47:16Zen_ZA
dc.date.issued2022-11en_ZA
dc.descriptionThesis (MSc) -- Stellenbosch University, 2022.en_ZA
dc.description.abstractENGLISH ABSTRACT: Regular expressions (regexes) constitute a concise, powerful, and useful pattern matching language for strings. They are widely supported in programming languages, text processing programs, and (advanced) text editors. In general, regex matching is performed efficiently, but in some cases, a vulnerable regex (and an exploit string) can cause the matching procedure to take exponential time in the length of the input string. As regexes are frequently used in userfacing circumstances, vulnerable regexes open up the underlying application to a potential denial of service vector. Several regex engines choose to implement an algorithm that will always match an input string in linear time, but can only support a subset of modern regex constructs. We show how to implement a construct known as Regular Expressions with Lookahead (REwLA), and investigate various state complexity results when converting REwLA to equivalent deterministic finite automata (DFA). The relationship between REwLA with atomic operations and REwLA without is investigated, and an algorithm for translating a REwLA with atomic operations to a REwLA without is provided. Vulnerable regexes only exist when the regex engine implements a backtracking algorithm. We extend non-deterministic finite a utomata (NFA) and regexes by adding memoization to these formalisms. Furthermore, we generalise the concept of ambiguity in order to be applicable to memoized extensions. These extensions are aimed at improving the matching time of backtracking regex engines.en_ZA
dc.description.abstractAFRIKAANS OPSOMMING: Regulêre uitdrukkings (regekse) vorm ’n bondige, kragtige en bruikbare taal vir die patroon-ooreenstemming van stringe. Hulle word ondersteun in ’n groot hoeveelheid van programmeertale, teksverwerkingsprogramme en (gevorderde) teksredigeerders. Oor die algemeen word regeks-passing doeltreffend uitgevoer, maar in sommige gevalle kan ’n kwesbare regeks (en ’n uitbuit-string) veroorsaak dat die ooreenstemming prosedure eksponensiële tyd in die lengte van die invoerstring neem. Aangesien regekse gereeld gebruik word in omstandighede waar die gebruiker die hoof akteur is, maak dit kwesbare regekse oop tot ’n moontlike ontkenning van diens aanval. Verskeie regeks-enjins kies om ’n algoritme te implementeer wat altyd ’n invoerstring in lineêre tyd sal pas, maar kan slegs ’n deelversameling van moderne regeks-konstruksies ondersteun. Ons wys hoe om ’n konstruksie, bekend as “Regular Expressions with Lookahead” (REwLA) te implementeer en ondersoek verskeie toestandskompleksiteitsresultate wanneer REwLA na ekwivalente deterministiese eindige outomatiese (DFA) omgeskakel word. Die verband tussen REwLA met atomiese operatore en REwLA daarsonder word ondersoek en ’n algoritme vir die vertaling van REwLA met atomiese operatore na REwLA daarsonder word verskaf. Kwesbare regekse bestaan slegs wanneer die regex-enjin ’n terugspooralgoritme implementeer. Ons brei nie-deterministiese eindige outomatiese (NFA) en regekse uit deur memoisering by hierdie formalisme te voeg. Verder veralgemeen ons die konsep van dubbelsinnigheid om van toepassing te wees op gememoriseerde uitbreidings. Hierdie uitbreidings is daarop gemik om die passingstyd van terugsporing van regeks-enjins te verbeter.af_ZA
dc.description.versionMastersen_ZA
dc.format.extentvii, 96 pages : illustrationsen_ZA
dc.identifier.urihttp://hdl.handle.net/10019.1/126045en_ZA
dc.language.isoen_ZAen_ZA
dc.publisherStellenbosch : Stellenbosch Universityen_ZA
dc.rights.holderStellenbosch Universityen_ZA
dc.subjectRegular expressions (Computer science)en_ZA
dc.subjectPattern recognition systemsen_ZA
dc.subjectProgramming languages (Electronic computers)en_ZA
dc.subjectUCTDen_ZA
dc.titleAn investigation into performance-related issues of regular expression matchingen_ZA
dc.typeThesisen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
vanlitsenborgh_investigation_2022.pdf
Size:
1.76 MB
Format:
Adobe Portable Document Format
Description: