Natural Language Processing with machine learning for anomaly detection on system call logs
- Authors: Goosen, Christo
- Date: 2023-10-13
- Subjects: Natural language processing (Computer science) , Machine learning , Information security , Anomaly detection (Computer security) , Host-based intrusion detection system
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/424699 , vital:72176
- Description: Host intrusion detection systems and machine learning have been studied for many years especially on datasets like KDD99. Current research and systems are focused on low training and processing complex problems such as system call returns, which lack the system call arguments and potential traces of exploits run against a system. With respect to malware and vulnerabilities, signatures are relied upon, and the potential for natural language processing of the resulting logs and system call traces needs further experimentation. This research looks at unstructured raw system call traces from x86_64 bit GNU Linux operating systems with natural language processing and supervised and unsupervised machine learning techniques to identify current and unseen threats. The research explores whether these tools are within the skill set of information security professionals, or require data science professionals. The research makes use of an academic and modern system call dataset from Leipzig University and applies two machine learning models based on decision trees. Random Forest as the supervised algorithm is compared to the unsupervised Isolation Forest algorithm for this research, with each experiment repeated after hyper-parameter tuning. The research finds conclusive evidence that the Isolation Forest Tree algorithm is effective, when paired with a Principal Component Analysis, in identifying anomalies in the modern Leipzig Intrusion Detection Data Set (LID-DS) dataset combined with samples of executed malware from the Virus Total Academic dataset. The base or default model parameters produce sub-optimal results, whereas using a hyper-parameter tuning technique increases the accuracy to within promising levels for anomaly and potential zero day detection. , Thesis (MSc) -- Faculty of Science, Computer Science, 2023
- Full Text:
- Date Issued: 2023-10-13
- Authors: Goosen, Christo
- Date: 2023-10-13
- Subjects: Natural language processing (Computer science) , Machine learning , Information security , Anomaly detection (Computer security) , Host-based intrusion detection system
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/424699 , vital:72176
- Description: Host intrusion detection systems and machine learning have been studied for many years especially on datasets like KDD99. Current research and systems are focused on low training and processing complex problems such as system call returns, which lack the system call arguments and potential traces of exploits run against a system. With respect to malware and vulnerabilities, signatures are relied upon, and the potential for natural language processing of the resulting logs and system call traces needs further experimentation. This research looks at unstructured raw system call traces from x86_64 bit GNU Linux operating systems with natural language processing and supervised and unsupervised machine learning techniques to identify current and unseen threats. The research explores whether these tools are within the skill set of information security professionals, or require data science professionals. The research makes use of an academic and modern system call dataset from Leipzig University and applies two machine learning models based on decision trees. Random Forest as the supervised algorithm is compared to the unsupervised Isolation Forest algorithm for this research, with each experiment repeated after hyper-parameter tuning. The research finds conclusive evidence that the Isolation Forest Tree algorithm is effective, when paired with a Principal Component Analysis, in identifying anomalies in the modern Leipzig Intrusion Detection Data Set (LID-DS) dataset combined with samples of executed malware from the Virus Total Academic dataset. The base or default model parameters produce sub-optimal results, whereas using a hyper-parameter tuning technique increases the accuracy to within promising levels for anomaly and potential zero day detection. , Thesis (MSc) -- Faculty of Science, Computer Science, 2023
- Full Text:
- Date Issued: 2023-10-13
Simplified menu-driven data analysis tool with macro-like automation
- Authors: Kazembe, Luntha
- Date: 2022-10-14
- Subjects: Data analysis , Macro instructions (Electronic computers) , Quantitative research Software , Python (Computer program language) , Scripting languages (Computer science)
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/362905 , vital:65373
- Description: This study seeks to improve the data analysis process for individuals and small businesses with limited resources by developing a simplified data analysis software tool that allows users to carry out data analysis effectively and efficiently. Design considerations were identified to address limitations common in such environments, these included making the tool easy-to-use, requiring only a basic understanding of the data analysis process, designing the tool in manner that minimises computing resource requirements and user interaction and implementing it using Python which is open-source, effective and efficient in processing data. We develop a prototype simplified data analysis tool as a proof-of-concept. The tool has two components, namely, core elements which provide functionality for the data anal- ysis process including data collection, transformations, analysis and visualizations, and automation and performance enhancements to improve the data analysis process. The automation enhancements consist of the record and playback macro feature while the performance enhancements include multiprocessing and multi-threading abilities. The data analysis software was developed to analyse various alpha-numeric data formats by using a variety of statistical and mathematical techniques. The record and playback macro feature enhances the data analysis process by saving users time and computing resources when analysing large volumes of data or carrying out repetitive data analysis tasks. The feature has two components namely, the record component that is used to record data analysis steps and the playback component used to execute recorded steps. The simplified data analysis tool has parallelization designed and implemented which allows users to carry out two or more analysis tasks at a time, this improves productivity as users can do other tasks while the tool is processing data using recorded steps in the background. The tool was created and subsequently tested using common analysis scenarios applied to network data, log data and stock data. Results show that decision-making requirements such as accurate information, can be satisfied using this analysis tool. Based on the functionality implemented, similar analysis functionality to that provided by Microsoft Excel is available, but in a simplified manner. Moreover, a more sophisticated macro functionality is provided for the execution of repetitive tasks using the recording feature. Overall, the study found that the simplified data analysis tool is functional, usable, scalable, efficient and can carry out multiple analysis tasks simultaneously. , Thesis (MSc) -- Faculty of Science, Computer Science, 2022
- Full Text:
- Date Issued: 2022-10-14
- Authors: Kazembe, Luntha
- Date: 2022-10-14
- Subjects: Data analysis , Macro instructions (Electronic computers) , Quantitative research Software , Python (Computer program language) , Scripting languages (Computer science)
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/362905 , vital:65373
- Description: This study seeks to improve the data analysis process for individuals and small businesses with limited resources by developing a simplified data analysis software tool that allows users to carry out data analysis effectively and efficiently. Design considerations were identified to address limitations common in such environments, these included making the tool easy-to-use, requiring only a basic understanding of the data analysis process, designing the tool in manner that minimises computing resource requirements and user interaction and implementing it using Python which is open-source, effective and efficient in processing data. We develop a prototype simplified data analysis tool as a proof-of-concept. The tool has two components, namely, core elements which provide functionality for the data anal- ysis process including data collection, transformations, analysis and visualizations, and automation and performance enhancements to improve the data analysis process. The automation enhancements consist of the record and playback macro feature while the performance enhancements include multiprocessing and multi-threading abilities. The data analysis software was developed to analyse various alpha-numeric data formats by using a variety of statistical and mathematical techniques. The record and playback macro feature enhances the data analysis process by saving users time and computing resources when analysing large volumes of data or carrying out repetitive data analysis tasks. The feature has two components namely, the record component that is used to record data analysis steps and the playback component used to execute recorded steps. The simplified data analysis tool has parallelization designed and implemented which allows users to carry out two or more analysis tasks at a time, this improves productivity as users can do other tasks while the tool is processing data using recorded steps in the background. The tool was created and subsequently tested using common analysis scenarios applied to network data, log data and stock data. Results show that decision-making requirements such as accurate information, can be satisfied using this analysis tool. Based on the functionality implemented, similar analysis functionality to that provided by Microsoft Excel is available, but in a simplified manner. Moreover, a more sophisticated macro functionality is provided for the execution of repetitive tasks using the recording feature. Overall, the study found that the simplified data analysis tool is functional, usable, scalable, efficient and can carry out multiple analysis tasks simultaneously. , Thesis (MSc) -- Faculty of Science, Computer Science, 2022
- Full Text:
- Date Issued: 2022-10-14
An analysis of fusing advanced malware email protection logs, malware intelligence and active directory attributes as an instrument for threat intelligence
- Authors: Vermeulen, Japie
- Date: 2018
- Subjects: Malware (Computer software) , Computer networks Security measures , Data mining , Phishing , Data logging , Quantitative research
- Language: English
- Type: text , Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10962/63922 , vital:28506
- Description: After more than four decades email is still the most widely used electronic communication medium today. This electronic communication medium has evolved into an electronic weapon of choice for cyber criminals ranging from the novice to the elite. As cyber criminals evolve with tools, tactics and procedures, so too are technology vendors coming forward with a variety of advanced malware protection systems. However, even if an organization adopts such a system, there is still the daily challenge of interpreting the log data and understanding the type of malicious email attack, including who the target was and what the payload was. This research examines a six month data set obtained from an advanced malware email protection system from a bank in South Africa. Extensive data fusion techniques are used to provide deeper insight into the data by blending these with malware intelligence and business context. The primary data set is fused with malware intelligence to identify the different malware families associated with the samples. Active Directory attributes such as the business cluster, department and job title of users targeted by malware are also fused into the combined data. This study provides insight into malware attacks experienced in the South African financial services sector. For example, most of the malware samples identified belonged to different types of ransomware families distributed by known botnets. However, indicators of targeted attacks were observed based on particular employees targeted with exploit code and specific strains of malware. Furthermore, a short time span between newly discovered vulnerabilities and the use of malicious code to exploit such vulnerabilities through email were observed in this study. The fused data set provided the context to answer the “who”, “what”, “where” and “when”. The proposed methodology can be applied to any organization to provide insight into the malware threats identified by advanced malware email protection systems. In addition, the fused data set provides threat intelligence that could be used to strengthen the cyber defences of an organization against cyber threats.
- Full Text:
- Date Issued: 2018
- Authors: Vermeulen, Japie
- Date: 2018
- Subjects: Malware (Computer software) , Computer networks Security measures , Data mining , Phishing , Data logging , Quantitative research
- Language: English
- Type: text , Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10962/63922 , vital:28506
- Description: After more than four decades email is still the most widely used electronic communication medium today. This electronic communication medium has evolved into an electronic weapon of choice for cyber criminals ranging from the novice to the elite. As cyber criminals evolve with tools, tactics and procedures, so too are technology vendors coming forward with a variety of advanced malware protection systems. However, even if an organization adopts such a system, there is still the daily challenge of interpreting the log data and understanding the type of malicious email attack, including who the target was and what the payload was. This research examines a six month data set obtained from an advanced malware email protection system from a bank in South Africa. Extensive data fusion techniques are used to provide deeper insight into the data by blending these with malware intelligence and business context. The primary data set is fused with malware intelligence to identify the different malware families associated with the samples. Active Directory attributes such as the business cluster, department and job title of users targeted by malware are also fused into the combined data. This study provides insight into malware attacks experienced in the South African financial services sector. For example, most of the malware samples identified belonged to different types of ransomware families distributed by known botnets. However, indicators of targeted attacks were observed based on particular employees targeted with exploit code and specific strains of malware. Furthermore, a short time span between newly discovered vulnerabilities and the use of malicious code to exploit such vulnerabilities through email were observed in this study. The fused data set provided the context to answer the “who”, “what”, “where” and “when”. The proposed methodology can be applied to any organization to provide insight into the malware threats identified by advanced malware email protection systems. In addition, the fused data set provides threat intelligence that could be used to strengthen the cyber defences of an organization against cyber threats.
- Full Text:
- Date Issued: 2018
- «
- ‹
- 1
- ›
- »