An investigation of dead-zone pattern matching algorithms

Date
2016-02
Journal Title
Journal ISSN
Volume Title
Publisher
Stellenbosch : Stellenbosch University
Abstract
ENGLISH ABSTRACT: Pattern matching allows us to search some text for a word or for a sequence of characters|a popular feature of computer programs such as text editors. Traditionally, three distinct families of pattern matching algorithms exist: the Boyer-Moore (BM) algorithm, the Knuth-Morris-Pratt (KMP) algorithm, and the Rabin-Karp (RK) algorithm. The basic algorithm in all these algorithmic families was developed in the 1970s and 1980s. However a new family of pattern matching algorithms, known as the Dead-Zone (DZ) family of algorithms, has recently been developed. In a previous study, it was theoretically proven that DZ is able to pattern match a text with fewer match attempts than the well- known Horspool algorithm, a derivative of the BM algorithm. The main aim of this study was to provide empirical evidence to determine whether DZ is faster in practice. A benchmark platform was developed to com- pare variants of the DZ algorithm to existing pattern matching algorithms. Initial experiments were performed with four C implementations of the DZ algorithm (two recursive and two iterative implementations). Subsequent to this, DZ variants that make use of di erent shift functions as well as two parallel variants of DZ (implemented with Pthreads and CUDA) were devel- oped. Additionally, the underlying skeleton of the DZ algorithm was tweaked to determine whether the DZ code was optimal. The benchmark results showed that the C implementation of the iterative DZ variants performed favourably. Both iterative algorithms beat traditional pat- tern matching algorithms when searching natural language and genome texts, particularly for short patterns. When di erent shift functions were used, the only time a DZ implementation performed better than an implementation of the traditional algorithm was for a pattern length of 65536 characters. Con- trary to our expectations, the parallel implementation of DZ did not always provide a speedup. In fact, the Pthreaded variants of DZ were slower than the non-threaded DZ implementations, although the CUDA DZ variants were consistently ve times faster than a CPU implementation of Horspool. By us- ing a cache-friendly DZ algorithm, which reduces cache misses by about 20%, the the original DZ can be improved by approximately 5% for relatively short patterns (up to 128 characters with a natural language text). Moreover, a cost of recursion and the impact of information sharing were observed for all DZ variants and have thus been identi ed as intrinsic DZ characteristics. Further research is recommended to determine whether the cache-friendly DZ algorithm should become the standard implementation of the DZ algorithm. In addition, we hope that the development of our benchmark platform has produced a technique that can be used by researchers in future studies to conduct benchmark tests
AFRIKAANSE OPSOMMING: Patroonpassing word gebruik om vir 'n reeks opeenvolgende karakters in 'n blok van teks te soek. Dit word breedvoerig programmaties in rekenaarprogramme gebruik, byvoorbeeld in die teksredigeerders. Tradisioneel is daar drie afsonderlike patroonpassingalgoritme families: die Boyer-Moore (BM) familie, Knuth-Morris-Pratt (KMP) familie en Rabin-Karp (RK) familie. Die basisal- goritmes in hierdie algoritmefamilies was reeds in die 1970s en 1980s ontwikkel. Maar, 'n nuwe patroonpassingsalgoritme familie is egter onlangs ontwikkel. Dit staan as die Dooie Gebied (DG) algoritme familie bekend. 'n Vorige studie het bewys dat DG algoritmes in staat is om patroonpassing uit te voer met minder passingpogings as die welbekende Hoorspool algoritme, wat 'n afgeleide algoritme van die BM algoritme is. Die hoofdoel met hierdie studie was om die DG familie van algoritmes empiries te ondersoek. 'n Normtoets platform is ontwikkel om veranderlikes van die DG algoritme met bestaande patroonpassingsalgoritmes te vergelyk. Aanvanklike eksperimente is met vier C implementasies van die DG algoritme uitgevoer. Twee van die implementasies is rekursief en die ander twee is iteratief. Daarna was DG variante ontwikkel wat van verskillende skuif- funksies gebruik gemaak Twee parallelle variante van DG was ook ontwikkel. Een maak gebruik van \Pthreads' en die ander is in CUDA geimplementeer. Verder was die C kode weergawe van die basiese DG algoritme fyn aangepas om vas te stel of die kode optimaal was. Die normtoetsresultate dui aan dat die C-implementasie van die iteratiewe DG variante gunstig presteer bo-oor die tradisionele patroonpassingsalgoritmes. Beide van die iteratiewe algoritmes klop die tradisionele patroonpassingsalgoritmes wanneer daar met relatiewe kort patrone getoets word. Die verrigting van verskeie skuif-funksies was ook geondersoek. Die enigste keer wanneer die DG algoritmes beter presteer het as die tradisionele algoritme, was vir patroonlengtes van 65536 karakters. Teen ons verwagtinge, het die parallelle implementasie nie altyd spoedtoename voorsien nie. Tewens, die \Pthread" variante van DG was stadiger as die nie-gerygde DG implementasies. Die CUDA DG variante was egter telkens vyf keer vinniger as die konvensionele SVE implementasie van Horspool. Die normtoetse het ook aangedui dat die oorspronklike DG kode naby aan optimaal was. Egter, deur 'n kas-vriendelike weergawe te gebruik wat kas oorslane met omtrent 20% verminder, kon die prestasie met naastenby 5% verbeter word vir relatiewe kort patrone (tot by 128 karakters met natuurlike taal teks). Verder was daar vir al die DG variante n rekursiekoste en 'n impak op inligtingdeling waargeneem wat as interne DG kenmerke geidentifiseer is. Verdere navorsing word aanbeveel om vas te stel of die kas-vriendelike DG algoritme die standaard implementasie van die DG algoritme behoort te word. Bykomstiglik, hoop ons dat die ontwikkeling van ons normtoets platform 'n tegniek geproduseer het wat deur navorsers in toekomstige studies gebruik kan word om normtoetse uit te voer.
Description
Thesis (MA)--Stellenbosch, 2016.
Keywords
Pattern matching, Algorithms, Dead-zone, CUDA (Computer architecture), Parallel algorithms, UCTD
Citation