A framework for high speed lexical classification of malicious URLs

Egan, Shaun Peter

Title: A framework for high speed lexical classification of malicious URLs
Creator: Egan, Shaun Peter
ThesisAdvisor: Irwin, Barry
Subject: Internet -- Security measures -- Research
Subject: Uniform Resource Identifiers -- Security measures -- Research
Subject: Neural networks (Computer science) -- Research
Subject: Computer security -- Research
Subject: Computer crimes -- Prevention
Subject: Phishing
Date: 2014
Type: Thesis
Type: Masters
Type: MSc
Identifier: vital:4696
Identifier: http://hdl.handle.net/10962/d1011933
Identifier: Internet -- Security measures -- Research
Identifier: Uniform Resource Identifiers -- Security measures -- Research
Identifier: Neural networks (Computer science) -- Research
Identifier: Computer security -- Research
Identifier: Computer crimes -- Prevention
Identifier: Phishing
Description: Phishing attacks employ social engineering to target end-users, with the goal of stealing identifying or sensitive information. This information is used in activities such as identity theft or financial fraud. During a phishing campaign, attackers distribute URLs which; along with false information, point to fraudulent resources in an attempt to deceive users into requesting the resource. These URLs are made obscure through the use of several techniques which make automated detection difficult. Current methods used to detect malicious URLs face multiple problems which attackers use to their advantage. These problems include: the time required to react to new attacks; shifts in trends in URL obfuscation and usability problems caused by the latency incurred by the lookups required by these approaches. A new method of identifying malicious URLs using Artificial Neural Networks (ANNs) has been shown to be effective by several authors. The simple method of classification performed by ANNs result in very high classification speeds with little impact on usability. Samples used for the training, validation and testing of these ANNs are gathered from Phishtank and Open Directory. Words selected from the different sections of the samples are used to create a `Bag-of-Words (BOW)' which is used as a binary input vector indicating the presence of a word for a given sample. Twenty additional features which measure lexical attributes of the sample are used to increase classification accuracy. A framework that is capable of generating these classifiers in an automated fashion is implemented. These classifiers are automatically stored on a remote update distribution service which has been built to supply updates to classifier implementations. An example browser plugin is created and uses ANNs provided by this service. It is both capable of classifying URLs requested by a user in real time and is able to block these requests. The framework is tested in terms of training time and classification accuracy. Classification speed and the effectiveness of compression algorithms on the data required to distribute updates is tested. It is concluded that it is possible to generate these ANNs in a frequent fashion, and in a method that is small enough to distribute easily. It is also shown that classifications are made at high-speed with high-accuracy, resulting in little impact on usability.
Format: 147 p., pdf
Publisher: Rhodes University, Faculty of Science, Computer Science
Language: English
Rights: Egan, Shaun Peter

Hits: 2422
Visitors: 2581
Downloads: 219

Collections

RU Department of Computer Science

		Thumbnail	File	Description	Size	Format
View Details Download			SOURCEPDF	PDF	1 MB	Adobe Acrobat PDF	View Details Download