- Title
- A framework for high speed lexical classification of malicious URLs
- Creator
- Egan, Shaun Peter
- ThesisAdvisor
- Irwin, Barry
- Subject
- Internet -- Security measures -- Research
- Subject
- Uniform Resource Identifiers -- Security measures -- Research
- Subject
- Neural networks (Computer science) -- Research
- Subject
- Computer security -- Research
- Subject
- Computer crimes -- Prevention
- Subject
- Phishing
- Date
- 2014
- Type
- Thesis
- Type
- Masters
- Type
- MSc
- Identifier
- vital:4696
- Identifier
- http://hdl.handle.net/10962/d1011933
- Identifier
- Internet -- Security measures -- Research
- Identifier
- Uniform Resource Identifiers -- Security measures -- Research
- Identifier
- Neural networks (Computer science) -- Research
- Identifier
- Computer security -- Research
- Identifier
- Computer crimes -- Prevention
- Identifier
- Phishing
- Description
- Phishing attacks employ social engineering to target end-users, with the goal of stealing identifying or sensitive information. This information is used in activities such as identity theft or financial fraud. During a phishing campaign, attackers distribute URLs which; along with false information, point to fraudulent resources in an attempt to deceive users into requesting the resource. These URLs are made obscure through the use of several techniques which make automated detection difficult. Current methods used to detect malicious URLs face multiple problems which attackers use to their advantage. These problems include: the time required to react to new attacks; shifts in trends in URL obfuscation and usability problems caused by the latency incurred by the lookups required by these approaches. A new method of identifying malicious URLs using Artificial Neural Networks (ANNs) has been shown to be effective by several authors. The simple method of classification performed by ANNs result in very high classification speeds with little impact on usability. Samples used for the training, validation and testing of these ANNs are gathered from Phishtank and Open Directory. Words selected from the different sections of the samples are used to create a `Bag-of-Words (BOW)' which is used as a binary input vector indicating the presence of a word for a given sample. Twenty additional features which measure lexical attributes of the sample are used to increase classification accuracy. A framework that is capable of generating these classifiers in an automated fashion is implemented. These classifiers are automatically stored on a remote update distribution service which has been built to supply updates to classifier implementations. An example browser plugin is created and uses ANNs provided by this service. It is both capable of classifying URLs requested by a user in real time and is able to block these requests. The framework is tested in terms of training time and classification accuracy. Classification speed and the effectiveness of compression algorithms on the data required to distribute updates is tested. It is concluded that it is possible to generate these ANNs in a frequent fashion, and in a method that is small enough to distribute easily. It is also shown that classifications are made at high-speed with high-accuracy, resulting in little impact on usability.
- Format
- 147 p., pdf
- Publisher
- Rhodes University, Faculty of Science, Computer Science
- Language
- English
- Rights
- Egan, Shaun Peter
- Hits: 2422
- Visitors: 2581
- Downloads: 219
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details Download | SOURCEPDF | 1 MB | Adobe Acrobat PDF | View Details Download |