Application of machine learning, molecular modelling and structural data mining against antiretroviral drug resistance in HIV-1
- Sheik Amamuddy, Olivier Serge André
- Authors: Sheik Amamuddy, Olivier Serge André
- Date: 2020
- Subjects: Machine learning , Molecules -- Models , Data mining , Neural networks (Computer science) , Antiretroviral agents , Protease inhibitors , Drug resistance , Multidrug resistance , Molecular dynamics , Renin-angiotensin system , HIV (Viruses) -- South Africa , HIV (Viruses) -- Social aspects -- South Africa , South African Natural Compounds Database
- Language: English
- Type: text , Thesis , Doctoral , PhD
- Identifier: http://hdl.handle.net/10962/115964 , vital:34282
- Description: Millions are affected with the Human Immunodeficiency Virus (HIV) world wide, even though the death toll is on the decline. Antiretrovirals (ARVs), more specifically protease inhibitors have shown tremendous success since their introduction into therapy since the mid 1990’s by slowing down progression to the Acquired Immune Deficiency Syndrome (AIDS). However, Drug Resistance Mutations (DRMs) are constantly selected for due to viral adaptation, making drugs less effective over time. The current challenge is to manage the infection optimally with a limited set of drugs, with differing associated levels of toxicities in the face of a virus that (1) exists as a quasispecies, (2) may transmit acquired DRMs to drug-naive individuals and (3) that can manifest class-wide resistance due to similarities in design. The presence of latent reservoirs, unawareness of infection status, education and various socio-economic factors make the problem even more complex. Adequate timing and choice of drug prescription together with treatment adherence are very important as drug toxicities, drug failure and sub-optimal treatment regimens leave room for further development of drug resistance. While CD4 cell count and the determination of viral load from patients in resource-limited settings are very helpful to track how well a patient’s immune system is able to keep the virus in check, they can be lengthy in determining whether an ARV is effective. Phenosense assay kits answer this problem using viruses engineered to contain the patient sequences and evaluating their growth in the presence of different ARVs, but this can be expensive and too involved for routine checks. As a cheaper and faster alternative, genotypic assays provide similar information from HIV pol sequences obtained from blood samples, inferring ARV efficacy on the basis of drug resistance mutation patterns. However, these are inherently complex and the various methods of in silico prediction, such as Geno2pheno, REGA and Stanford HIVdb do not always agree in every case, even though this gap decreases as the list of resistance mutations is updated. A major gap in HIV treatment is that the information used for predicting drug resistance is mainly computed from data containing an overwhelming majority of B subtype HIV, when these only comprise about 12% of the worldwide HIV infections. In addition to growing evidence that drug resistance is subtype-related, it is intuitive to hypothesize that as subtyping is a phylogenetic classification, the more divergent a subtype is from the strains used in training prediction models, the less their resistance profiles would correlate. For the aforementioned reasons, we used a multi-faceted approach to attack the virus in multiple ways. This research aimed to (1) improve resistance prediction methods by focusing solely on the available subtype, (2) mine structural information pertaining to resistance in order to find any exploitable weak points and increase knowledge of the mechanistic processes of drug resistance in HIV protease. Finally, (3) we screen for protease inhibitors amongst a database of natural compounds [the South African natural compound database (SANCDB)] to find molecules or molecular properties usable to come up with improved inhibition against the drug target. In this work, structural information was mined using the Anisotropic Network Model, Dynamics Cross-Correlation, Perturbation Response Scanning, residue contact network analysis and the radius of gyration. These methods failed to give any resistance-associated patterns in terms of natural movement, internal correlated motions, residue perturbation response, relational behaviour and global compaction respectively. Applications of drug docking, homology-modelling and energy minimization for generating features suitable for machine-learning were not very promising, and rather suggest that the value of binding energies by themselves from Vina may not be very reliable quantitatively. All these failures lead to a refinement that resulted in a highly sensitive statistically-guided network construction and analysis, which leads to key findings in the early dynamics associated with resistance across all PI drugs. The latter experiment unravelled a conserved lateral expansion motion occurring at the flap elbows, and an associated contraction that drives the base of the dimerization domain towards the catalytic site’s floor in the case of drug resistance. Interestingly, we found that despite the conserved movement, bond angles were degenerate. Alongside, 16 Artificial Neural Network models were optimised for HIV proteases and reverse transcriptase inhibitors, with performances on par with Stanford HIVdb. Finally, we prioritised 9 compounds with potential protease inhibitory activity using virtual screening and molecular dynamics (MD) to additionally suggest a promising modification to one of the compounds. This yielded another molecule inhibiting equally well both opened and closed receptor target conformations, whereby each of the compounds had been selected against an array of multi-drug-resistant receptor variants. While a main hurdle was a lack of non-B subtype data, our findings, especially from the statistically-guided network analysis, may extrapolate to a certain extent to them as the level of conservation was very high within subtype B, despite all the present variations. This network construction method lays down a sensitive approach for analysing a pair of alternate phenotypes for which complex patterns prevail, given a sufficient number of experimental units. During the course of research a weighted contact mapping tool was developed to compare renin-angiotensinogen variants and packaged as part of the MD-TASK tool suite. Finally the functionality, compatibility and performance of the MODE-TASK tool were evaluated and confirmed for both Python2.7.x and Python3.x, for the analysis of normals modes from single protein structures and essential modes from MD trajectories. These techniques and tools collectively add onto the conventional means of MD analysis.
- Full Text:
- Authors: Sheik Amamuddy, Olivier Serge André
- Date: 2020
- Subjects: Machine learning , Molecules -- Models , Data mining , Neural networks (Computer science) , Antiretroviral agents , Protease inhibitors , Drug resistance , Multidrug resistance , Molecular dynamics , Renin-angiotensin system , HIV (Viruses) -- South Africa , HIV (Viruses) -- Social aspects -- South Africa , South African Natural Compounds Database
- Language: English
- Type: text , Thesis , Doctoral , PhD
- Identifier: http://hdl.handle.net/10962/115964 , vital:34282
- Description: Millions are affected with the Human Immunodeficiency Virus (HIV) world wide, even though the death toll is on the decline. Antiretrovirals (ARVs), more specifically protease inhibitors have shown tremendous success since their introduction into therapy since the mid 1990’s by slowing down progression to the Acquired Immune Deficiency Syndrome (AIDS). However, Drug Resistance Mutations (DRMs) are constantly selected for due to viral adaptation, making drugs less effective over time. The current challenge is to manage the infection optimally with a limited set of drugs, with differing associated levels of toxicities in the face of a virus that (1) exists as a quasispecies, (2) may transmit acquired DRMs to drug-naive individuals and (3) that can manifest class-wide resistance due to similarities in design. The presence of latent reservoirs, unawareness of infection status, education and various socio-economic factors make the problem even more complex. Adequate timing and choice of drug prescription together with treatment adherence are very important as drug toxicities, drug failure and sub-optimal treatment regimens leave room for further development of drug resistance. While CD4 cell count and the determination of viral load from patients in resource-limited settings are very helpful to track how well a patient’s immune system is able to keep the virus in check, they can be lengthy in determining whether an ARV is effective. Phenosense assay kits answer this problem using viruses engineered to contain the patient sequences and evaluating their growth in the presence of different ARVs, but this can be expensive and too involved for routine checks. As a cheaper and faster alternative, genotypic assays provide similar information from HIV pol sequences obtained from blood samples, inferring ARV efficacy on the basis of drug resistance mutation patterns. However, these are inherently complex and the various methods of in silico prediction, such as Geno2pheno, REGA and Stanford HIVdb do not always agree in every case, even though this gap decreases as the list of resistance mutations is updated. A major gap in HIV treatment is that the information used for predicting drug resistance is mainly computed from data containing an overwhelming majority of B subtype HIV, when these only comprise about 12% of the worldwide HIV infections. In addition to growing evidence that drug resistance is subtype-related, it is intuitive to hypothesize that as subtyping is a phylogenetic classification, the more divergent a subtype is from the strains used in training prediction models, the less their resistance profiles would correlate. For the aforementioned reasons, we used a multi-faceted approach to attack the virus in multiple ways. This research aimed to (1) improve resistance prediction methods by focusing solely on the available subtype, (2) mine structural information pertaining to resistance in order to find any exploitable weak points and increase knowledge of the mechanistic processes of drug resistance in HIV protease. Finally, (3) we screen for protease inhibitors amongst a database of natural compounds [the South African natural compound database (SANCDB)] to find molecules or molecular properties usable to come up with improved inhibition against the drug target. In this work, structural information was mined using the Anisotropic Network Model, Dynamics Cross-Correlation, Perturbation Response Scanning, residue contact network analysis and the radius of gyration. These methods failed to give any resistance-associated patterns in terms of natural movement, internal correlated motions, residue perturbation response, relational behaviour and global compaction respectively. Applications of drug docking, homology-modelling and energy minimization for generating features suitable for machine-learning were not very promising, and rather suggest that the value of binding energies by themselves from Vina may not be very reliable quantitatively. All these failures lead to a refinement that resulted in a highly sensitive statistically-guided network construction and analysis, which leads to key findings in the early dynamics associated with resistance across all PI drugs. The latter experiment unravelled a conserved lateral expansion motion occurring at the flap elbows, and an associated contraction that drives the base of the dimerization domain towards the catalytic site’s floor in the case of drug resistance. Interestingly, we found that despite the conserved movement, bond angles were degenerate. Alongside, 16 Artificial Neural Network models were optimised for HIV proteases and reverse transcriptase inhibitors, with performances on par with Stanford HIVdb. Finally, we prioritised 9 compounds with potential protease inhibitory activity using virtual screening and molecular dynamics (MD) to additionally suggest a promising modification to one of the compounds. This yielded another molecule inhibiting equally well both opened and closed receptor target conformations, whereby each of the compounds had been selected against an array of multi-drug-resistant receptor variants. While a main hurdle was a lack of non-B subtype data, our findings, especially from the statistically-guided network analysis, may extrapolate to a certain extent to them as the level of conservation was very high within subtype B, despite all the present variations. This network construction method lays down a sensitive approach for analysing a pair of alternate phenotypes for which complex patterns prevail, given a sufficient number of experimental units. During the course of research a weighted contact mapping tool was developed to compare renin-angiotensinogen variants and packaged as part of the MD-TASK tool suite. Finally the functionality, compatibility and performance of the MODE-TASK tool were evaluated and confirmed for both Python2.7.x and Python3.x, for the analysis of normals modes from single protein structures and essential modes from MD trajectories. These techniques and tools collectively add onto the conventional means of MD analysis.
- Full Text:
An analysis of fusing advanced malware email protection logs, malware intelligence and active directory attributes as an instrument for threat intelligence
- Authors: Vermeulen, Japie
- Date: 2018
- Subjects: Malware (Computer software) , Computer networks Security measures , Data mining , Phishing , Data logging , Quantitative research
- Language: English
- Type: text , Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10962/63922 , vital:28506
- Description: After more than four decades email is still the most widely used electronic communication medium today. This electronic communication medium has evolved into an electronic weapon of choice for cyber criminals ranging from the novice to the elite. As cyber criminals evolve with tools, tactics and procedures, so too are technology vendors coming forward with a variety of advanced malware protection systems. However, even if an organization adopts such a system, there is still the daily challenge of interpreting the log data and understanding the type of malicious email attack, including who the target was and what the payload was. This research examines a six month data set obtained from an advanced malware email protection system from a bank in South Africa. Extensive data fusion techniques are used to provide deeper insight into the data by blending these with malware intelligence and business context. The primary data set is fused with malware intelligence to identify the different malware families associated with the samples. Active Directory attributes such as the business cluster, department and job title of users targeted by malware are also fused into the combined data. This study provides insight into malware attacks experienced in the South African financial services sector. For example, most of the malware samples identified belonged to different types of ransomware families distributed by known botnets. However, indicators of targeted attacks were observed based on particular employees targeted with exploit code and specific strains of malware. Furthermore, a short time span between newly discovered vulnerabilities and the use of malicious code to exploit such vulnerabilities through email were observed in this study. The fused data set provided the context to answer the “who”, “what”, “where” and “when”. The proposed methodology can be applied to any organization to provide insight into the malware threats identified by advanced malware email protection systems. In addition, the fused data set provides threat intelligence that could be used to strengthen the cyber defences of an organization against cyber threats.
- Full Text:
- Authors: Vermeulen, Japie
- Date: 2018
- Subjects: Malware (Computer software) , Computer networks Security measures , Data mining , Phishing , Data logging , Quantitative research
- Language: English
- Type: text , Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10962/63922 , vital:28506
- Description: After more than four decades email is still the most widely used electronic communication medium today. This electronic communication medium has evolved into an electronic weapon of choice for cyber criminals ranging from the novice to the elite. As cyber criminals evolve with tools, tactics and procedures, so too are technology vendors coming forward with a variety of advanced malware protection systems. However, even if an organization adopts such a system, there is still the daily challenge of interpreting the log data and understanding the type of malicious email attack, including who the target was and what the payload was. This research examines a six month data set obtained from an advanced malware email protection system from a bank in South Africa. Extensive data fusion techniques are used to provide deeper insight into the data by blending these with malware intelligence and business context. The primary data set is fused with malware intelligence to identify the different malware families associated with the samples. Active Directory attributes such as the business cluster, department and job title of users targeted by malware are also fused into the combined data. This study provides insight into malware attacks experienced in the South African financial services sector. For example, most of the malware samples identified belonged to different types of ransomware families distributed by known botnets. However, indicators of targeted attacks were observed based on particular employees targeted with exploit code and specific strains of malware. Furthermore, a short time span between newly discovered vulnerabilities and the use of malicious code to exploit such vulnerabilities through email were observed in this study. The fused data set provided the context to answer the “who”, “what”, “where” and “when”. The proposed methodology can be applied to any organization to provide insight into the malware threats identified by advanced malware email protection systems. In addition, the fused data set provides threat intelligence that could be used to strengthen the cyber defences of an organization against cyber threats.
- Full Text:
Log analysis aided by latent semantic mapping
- Authors: Buys, Stephanus
- Date: 2013 , 2013-04-14
- Subjects: Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4575 , http://hdl.handle.net/10962/d1002963 , Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Description: In an age of zero-day exploits and increased on-line attacks on computing infrastructure, operational security practitioners are becoming increasingly aware of the value of the information captured in log events. Analysis of these events is critical during incident response, forensic investigations related to network breaches, hacking attacks and data leaks. Such analysis has led to the discipline of Security Event Analysis, also known as Log Analysis. There are several challenges when dealing with events, foremost being the increased volumes at which events are often generated and stored. Furthermore, events are often captured as unstructured data, with very little consistency in the formats or contents of the events. In this environment, security analysts and implementers of Log Management (LM) or Security Information and Event Management (SIEM) systems face the daunting task of identifying, classifying and disambiguating massive volumes of events in order for security analysis and automation to proceed. Latent Semantic Mapping (LSM) is a proven paradigm shown to be an effective method of, among other things, enabling word clustering, document clustering, topic clustering and semantic inference. This research is an investigation into the practical application of LSM in the discipline of Security Event Analysis, showing the value of using LSM to assist practitioners in identifying types of events, classifying events as belonging to certain sources or technologies and disambiguating different events from each other. The culmination of this research presents adaptations to traditional natural language processing techniques that resulted in improved efficacy of LSM when dealing with Security Event Analysis. This research provides strong evidence supporting the wider adoption and use of LSM, as well as further investigation into Security Event Analysis assisted by LSM and other natural language or computer-learning processing techniques. , LaTeX with hyperref package , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
- Authors: Buys, Stephanus
- Date: 2013 , 2013-04-14
- Subjects: Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4575 , http://hdl.handle.net/10962/d1002963 , Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Description: In an age of zero-day exploits and increased on-line attacks on computing infrastructure, operational security practitioners are becoming increasingly aware of the value of the information captured in log events. Analysis of these events is critical during incident response, forensic investigations related to network breaches, hacking attacks and data leaks. Such analysis has led to the discipline of Security Event Analysis, also known as Log Analysis. There are several challenges when dealing with events, foremost being the increased volumes at which events are often generated and stored. Furthermore, events are often captured as unstructured data, with very little consistency in the formats or contents of the events. In this environment, security analysts and implementers of Log Management (LM) or Security Information and Event Management (SIEM) systems face the daunting task of identifying, classifying and disambiguating massive volumes of events in order for security analysis and automation to proceed. Latent Semantic Mapping (LSM) is a proven paradigm shown to be an effective method of, among other things, enabling word clustering, document clustering, topic clustering and semantic inference. This research is an investigation into the practical application of LSM in the discipline of Security Event Analysis, showing the value of using LSM to assist practitioners in identifying types of events, classifying events as belonging to certain sources or technologies and disambiguating different events from each other. The culmination of this research presents adaptations to traditional natural language processing techniques that resulted in improved efficacy of LSM when dealing with Security Event Analysis. This research provides strong evidence supporting the wider adoption and use of LSM, as well as further investigation into Security Event Analysis assisted by LSM and other natural language or computer-learning processing techniques. , LaTeX with hyperref package , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
- «
- ‹
- 1
- ›
- »