Combining tree kernels and text embeddings for plagiarism detection

dc.contributor.advisorVan der Merwe, A. B.en_ZA
dc.contributor.advisorKroon, R. S. (Steve)en_ZA
dc.contributor.authorThom, Jacobus Danielen_ZA
dc.contributor.otherStellenbosch University. Faculty of Science. Dept. of Mathematical Sciences (Computer Science)en_ZA
dc.date.accessioned2018-02-20T18:20:45Z
dc.date.accessioned2018-04-09T07:00:11Z
dc.date.available2018-02-20T18:20:45Z
dc.date.available2018-04-09T07:00:11Z
dc.date.issued2018-03
dc.descriptionThesis (MSc)--Stellenbosch University, 2018.en_ZA
dc.description.abstractENGLISH ABSTRACT : The internet allows for vast amounts of information to be accessed with ease. Consequently, it becomes much easier to plagiarize any of this information as well. Most plagiarism detection techniques rely on n-grams to find similarities between suspicious documents and possible sources. N-grams, due to their simplicity, do not make full use of all the syntactic and semantic information contained in sentences. We therefore investigated two methods, namely tree kernels applied to the parse trees of sentences and text embeddings, to utilize more syntactic and semantic information respectively. A plagiarism detector was developed using these techniques and its effectiveness was tested on the PAN 2009 and 2011 external plagiarism corpora. The detector achieved results that were on par with the state of the art for both PAN 2009 and PAN 2011. This indicates that the combination of tree kernel and text embedding techniques is a viable method of plagiarism detection.en_ZA
dc.description.abstractAFRIKAANSE OPSOMMING : Die internet laat mens toe om groot hoeveelhede inligting maklik in die hande te kry. Gevolglik word dit ook baie makliker om plagiaat op enige van hierdie inligting te pleeg. Meeste plagiaatopsporingstegnieke maak staat op n-gramme om ooreenkomste tussen verdagte dokumente en moontlike bronne op te spoor. Aangesien n-gramme taamlik eenvoudig is, maak hulle nie volle gebruik van al die syntaktiese en semantiese inligting wat sinne bevat nie. Ons ondersoek dus twee metodes, naamlik boomkernfunksies, wat toegepas word op die ontledingsbome van sinne, en teksinbeddings, om onderskeidelik meer sintaktiese en semantiese inligting te gebruik. 'n Plagiaatdetektor is ontwikkel met behulp van hierdie twee tegnieke en die e ektiwiteit daarvan is getoets op die PAN 2009 en 2011 eksterne plagiaatkorpora. Die detektor het resultate behaal wat vergelykbaar was met die beste vir beide PAN 2009 en PAN 2011. Dit dui aan dat die kombinasie van boomkern- en teksinbeddingstegnieke 'n redelike metode van plagiaatopsporing is.af_ZA
dc.format.extentxii, 73 pages : illustrations (some colour)en_ZA
dc.identifier.urihttp://hdl.handle.net/10019.1/103550
dc.language.isoen_ZAen_ZA
dc.publisherStellenbosch : Stellenbosch Universityen_ZA
dc.rights.holderStellenbosch Universityen_ZA
dc.subjectText embeddingsen_ZA
dc.subjectPlagiarism -- Detectionen_ZA
dc.subjectTree kernelsen_ZA
dc.subjectSyntactic structuresen_ZA
dc.subjectSemantic structuresen_ZA
dc.titleCombining tree kernels and text embeddings for plagiarism detectionen_ZA
dc.typeThesisen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thom_combining_2018.pdf
Size:
1.96 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: