Combining tree kernels and text embeddings for plagiarism detection

Thom, Jacobus Daniël (2018-03)

Thesis (MSc)--Stellenbosch University, 2018.

Thesis

ENGLISH ABSTRACT : The internet allows for vast amounts of information to be accessed with ease. Consequently, it becomes much easier to plagiarize any of this information as well. Most plagiarism detection techniques rely on n-grams to find similarities between suspicious documents and possible sources. N-grams, due to their simplicity, do not make full use of all the syntactic and semantic information contained in sentences. We therefore investigated two methods, namely tree kernels applied to the parse trees of sentences and text embeddings, to utilize more syntactic and semantic information respectively. A plagiarism detector was developed using these techniques and its effectiveness was tested on the PAN 2009 and 2011 external plagiarism corpora. The detector achieved results that were on par with the state of the art for both PAN 2009 and PAN 2011. This indicates that the combination of tree kernel and text embedding techniques is a viable method of plagiarism detection.

AFRIKAANSE OPSOMMING : Die internet laat mens toe om groot hoeveelhede inligting maklik in die hande te kry. Gevolglik word dit ook baie makliker om plagiaat op enige van hierdie inligting te pleeg. Meeste plagiaatopsporingstegnieke maak staat op n-gramme om ooreenkomste tussen verdagte dokumente en moontlike bronne op te spoor. Aangesien n-gramme taamlik eenvoudig is, maak hulle nie volle gebruik van al die syntaktiese en semantiese inligting wat sinne bevat nie. Ons ondersoek dus twee metodes, naamlik boomkernfunksies, wat toegepas word op die ontledingsbome van sinne, en teksinbeddings, om onderskeidelik meer sintaktiese en semantiese inligting te gebruik. 'n Plagiaatdetektor is ontwikkel met behulp van hierdie twee tegnieke en die e ektiwiteit daarvan is getoets op die PAN 2009 en 2011 eksterne plagiaatkorpora. Die detektor het resultate behaal wat vergelykbaar was met die beste vir beide PAN 2009 en PAN 2011. Dit dui aan dat die kombinasie van boomkern- en teksinbeddingstegnieke 'n redelike metode van plagiaatopsporing is.

Please refer to this item in SUNScholar by using the following persistent URL: http://hdl.handle.net/10019.1/103550
This item appears in the following collections: