Application of machine learning, molecular modelling and structural data mining against antiretroviral drug resistance in HIV-1
- Authors: Sheik Amamuddy, Olivier Serge André
- Date: 2020
- Subjects: Machine learning , Molecules -- Models , Data mining , Neural networks (Computer science) , Antiretroviral agents , Protease inhibitors , Drug resistance , Multidrug resistance , Molecular dynamics , Renin-angiotensin system , HIV (Viruses) -- South Africa , HIV (Viruses) -- Social aspects -- South Africa , South African Natural Compounds Database
- Language: English
- Type: text , Thesis , Doctoral , PhD
- Identifier: http://hdl.handle.net/10962/115964 , vital:34282
- Description: Millions are affected with the Human Immunodeficiency Virus (HIV) world wide, even though the death toll is on the decline. Antiretrovirals (ARVs), more specifically protease inhibitors have shown tremendous success since their introduction into therapy since the mid 1990’s by slowing down progression to the Acquired Immune Deficiency Syndrome (AIDS). However, Drug Resistance Mutations (DRMs) are constantly selected for due to viral adaptation, making drugs less effective over time. The current challenge is to manage the infection optimally with a limited set of drugs, with differing associated levels of toxicities in the face of a virus that (1) exists as a quasispecies, (2) may transmit acquired DRMs to drug-naive individuals and (3) that can manifest class-wide resistance due to similarities in design. The presence of latent reservoirs, unawareness of infection status, education and various socio-economic factors make the problem even more complex. Adequate timing and choice of drug prescription together with treatment adherence are very important as drug toxicities, drug failure and sub-optimal treatment regimens leave room for further development of drug resistance. While CD4 cell count and the determination of viral load from patients in resource-limited settings are very helpful to track how well a patient’s immune system is able to keep the virus in check, they can be lengthy in determining whether an ARV is effective. Phenosense assay kits answer this problem using viruses engineered to contain the patient sequences and evaluating their growth in the presence of different ARVs, but this can be expensive and too involved for routine checks. As a cheaper and faster alternative, genotypic assays provide similar information from HIV pol sequences obtained from blood samples, inferring ARV efficacy on the basis of drug resistance mutation patterns. However, these are inherently complex and the various methods of in silico prediction, such as Geno2pheno, REGA and Stanford HIVdb do not always agree in every case, even though this gap decreases as the list of resistance mutations is updated. A major gap in HIV treatment is that the information used for predicting drug resistance is mainly computed from data containing an overwhelming majority of B subtype HIV, when these only comprise about 12% of the worldwide HIV infections. In addition to growing evidence that drug resistance is subtype-related, it is intuitive to hypothesize that as subtyping is a phylogenetic classification, the more divergent a subtype is from the strains used in training prediction models, the less their resistance profiles would correlate. For the aforementioned reasons, we used a multi-faceted approach to attack the virus in multiple ways. This research aimed to (1) improve resistance prediction methods by focusing solely on the available subtype, (2) mine structural information pertaining to resistance in order to find any exploitable weak points and increase knowledge of the mechanistic processes of drug resistance in HIV protease. Finally, (3) we screen for protease inhibitors amongst a database of natural compounds [the South African natural compound database (SANCDB)] to find molecules or molecular properties usable to come up with improved inhibition against the drug target. In this work, structural information was mined using the Anisotropic Network Model, Dynamics Cross-Correlation, Perturbation Response Scanning, residue contact network analysis and the radius of gyration. These methods failed to give any resistance-associated patterns in terms of natural movement, internal correlated motions, residue perturbation response, relational behaviour and global compaction respectively. Applications of drug docking, homology-modelling and energy minimization for generating features suitable for machine-learning were not very promising, and rather suggest that the value of binding energies by themselves from Vina may not be very reliable quantitatively. All these failures lead to a refinement that resulted in a highly sensitive statistically-guided network construction and analysis, which leads to key findings in the early dynamics associated with resistance across all PI drugs. The latter experiment unravelled a conserved lateral expansion motion occurring at the flap elbows, and an associated contraction that drives the base of the dimerization domain towards the catalytic site’s floor in the case of drug resistance. Interestingly, we found that despite the conserved movement, bond angles were degenerate. Alongside, 16 Artificial Neural Network models were optimised for HIV proteases and reverse transcriptase inhibitors, with performances on par with Stanford HIVdb. Finally, we prioritised 9 compounds with potential protease inhibitory activity using virtual screening and molecular dynamics (MD) to additionally suggest a promising modification to one of the compounds. This yielded another molecule inhibiting equally well both opened and closed receptor target conformations, whereby each of the compounds had been selected against an array of multi-drug-resistant receptor variants. While a main hurdle was a lack of non-B subtype data, our findings, especially from the statistically-guided network analysis, may extrapolate to a certain extent to them as the level of conservation was very high within subtype B, despite all the present variations. This network construction method lays down a sensitive approach for analysing a pair of alternate phenotypes for which complex patterns prevail, given a sufficient number of experimental units. During the course of research a weighted contact mapping tool was developed to compare renin-angiotensinogen variants and packaged as part of the MD-TASK tool suite. Finally the functionality, compatibility and performance of the MODE-TASK tool were evaluated and confirmed for both Python2.7.x and Python3.x, for the analysis of normals modes from single protein structures and essential modes from MD trajectories. These techniques and tools collectively add onto the conventional means of MD analysis.
- Full Text:
- Date Issued: 2020
Mechanism of action of non-synonymous single nucleotide variations associated with α-carbonic anhydrases II, IV and VIII
- Authors: Sanyanga, T. Allan
- Date: 2020
- Subjects: Carbonic anhydrase , Carbonic anhydrase -- Therapeutic use , Nucleotides
- Language: English
- Type: text , Thesis , Doctoral , PhD
- Identifier: http://hdl.handle.net/10962/167346 , vital:41470
- Description: The carbonic anhydrase (CA) group of enzymes are Zinc (Zn2+) metalloproteins responsible for the reversible hydration of CO2 to bicarbonate (BCT or HCO− 3 ) and protons (H+) for the facilitation of acid-base balance and homeostasis within the body. Across all organisms, a minimum of six CA families exist, including, α (alpha), β (beta), γ (gamma), δ (delta), η (eta) and ζ (zeta). Some organisms can have more than one family, with exception to humans that contain the α family solely. The α-CA family comprises of 16 isoforms (CA-I to CA-XV) including the CA-VIII, CA-X and CA-XI acatalytic isoforms. Of the catalytic isoforms, CA-II and CA-IV possess one of the fastest rates of reaction, and any disturbances to the function of these enzymes results in CA deficiencies and undesirable phenotypes. CA-II deficiencies result in osteopetrosis with renal tubular acidosis and cerebral calcification, whereas CA-IV deficiencies result in retinitis pigmentosa 17 (RP17). Phenotypic effects generally manifest as a result of poor protein folding and function due to the presence of non-synonymous single nucleotide variations (nsSNVs). Even within the acatalytic isoforms such as CA-VIII that llosterically regulates the affinity of inositol triphosphate (IP3) for the IP3 receptor type 1 (ITPR1) and regulates calcium (Ca2+) signalling, the presence of SNVs also causes phenotypes cerebellar ataxia, mental retardation, and dysequilibrium syndrome 3 (CAMRQ3). Currently the majority of research into the CAs is focused on the inhibition of these proteins to achieve therapeutic effects in patients via the control of HCO− production or reabsorption as observed in glaucoma and diuretic medications. Little research has therefore been devoted into the identification of stabilising or activating compound that could rescue protein function in the case of deficiencies. The main aim of this research was to identify and characterise the effects of nsSNVs on the structure and function of CA-II, CA-IV and CA-VIII to set a foundation for rare disease studies into the CA group of proteins. Combined bioinformatics approaches divided into four main objectives were implemented. These included variant identification, sequence analysis and protein characterisation, force field (FF) parameter generation, molecular dynamics (MD) simulation and dynamic residue network analysis (DRN). Six variants for each of the CA-II, CA-IV and CA-VIII proteins with pathogenic annotations were identified from the HUMA and Ensembl databases. These included the pathogenic variants K18E, K18Q, H107Y, P236H, P236R and N252D for CA-II. CA-IV included the pathogenic R69H, R219C and R219S, and benign N86K, N177K and V234I variants. CA-VIII included pathogenic S100A, S100P, G162R and R237Q, and benign S100L and E109D variants. CA-II has been more extensively studied than CA-IV and CA-VIII, therefore residues essential to its function and stability are known. To discover important residues and regions within the CA-IV and CA-VIII proteins sequence and motif analysis was performed across the α-CA family, using CA-II as a reference. Sequence analysis identified multiple conserved residues between the two acatalytic CA-II and CA-IV, and the acatalytic CA-VIII isoforms that were proposed to be essential for protein stability. With exception to the benign N86K CA-IV variant, none of the other pathogenic or benign CA-II, CA-IV and CA-VIII SNVs were located at functionally or structurally important residues. Motif analysis identified 11 conserved and important motifs within the α-CA family. Several of the identified variants were located on these motifs including K18E, K18Q, H107Y and N252D (CA-II); N86K, R219C, R219S and V234I (CA-IV); and E109D, G162R and R237Q (CA-VIII). As there were no x-ray crystal structures of the variant proteins, homology modelling was performed to calculate the protein structures for characterisation. In CA-VIII, the substitution of Ser for Pro at position 100 (variant S100P) resulted in destruction of the β-sheet that the SNV was located on. Little is known about the mechanism of interaction between CA-VIII and ITPR1, and residues involved. SiteMap and CPORT were used to identify binding site amino for CA-VIII and results identified 38 potential residues. Traditional FFs are incapable of performing MD simulations of metalloproteins. The AMBER ff14SB FF was extended and Zn2+ FF parameters calculated to add support for metalloprotein MD simulations. In the protein, Zn2+ was noted to have a charge less than +1. Variant effects on protein structure were then investigated using MD simulations. Root mean square deviation (RMSD) and radius of gyration (Rg) results indicated subtle SNV effects to the variant global structure in CA-II and CA-IV. However, with regards to CA-VIII RMSD analysis highlighted that variant presence was associated with increases to the structural rigidity of the protein. Principal component analysis (PCA) in conjunction with free energy analysis was performed to observe variant effects on protein conformational sampling in 3D space. The binding of BCT to CA-II induced greater protein conformational sampling and was associated with higher free energy. In CA-IV and CA-VIII PCA analysis revealed key differences in the mechanism of action of pathogenic and benign SNVs. In CA-IV, wild-type (WT) and benign variant protein structures clustered into single low energy well hinting at the presence of more stable structures. Pathogenic variants were associated with higher free energy and proteins sampled more conformations without settling into a low energy well. PCA analysis of CA-VIII indicated the opposite to CA-IV. Pathogenic variants were clustered into low energy wells, while the WT and benign variants showed greater conformational sampling. Dynamic cross correlation (DCC) analysis was performed using the MD-TASK suite to determine variant effects on residue movement. CA-II WT protein revealed that BCT and CO2 were associated with anti-correlated and correlated residue movement, highlighting at opposite mechanisms. In CA-IV and CA-VIII variant presence resulted in a change to residue correlation compared to the WT proteins. DRN analysis was performed to investigate SNV effects of residue accessibility and communication. Results demonstrated that SNVs are associated with allosteric effects on the CA protein structures, and effects are located on the stability assisting residues of the aromatic clusters and the active site of the proteins. CA-II studies discovered that Glu117 is the most important residue for communication, and variant presence results in a decrease to the usage of the residue. This effect was greatest in the CA-II H107Y SNV, and suggests that variants could have an effect on Zn2+ dissociation from the active site. Decreases to the usage of Zn2+ coordinating residues were also noted. Where this occurred, compensatory increases to the usage of other primary and secondary coordination residues were observed, that could possibly assist with the maintenance of Zn2+ within the active site. The CA-IV variants R69H and R219C highlighted potentially similar pathogenic mechanisms, whereas N86K and N177K hinted at potentially similar benign mechanisms. Within CA-VIII, variant presence was associated with changes to the accessibility of the N-terminal binding site residues. The benign CA-VIII variants highlighted possible compensatory mechanisms, whereby as one group of N-terminal residues loses accessibility, there was an increase to the accessibility of other binding site residues to possibly balance the effect. Catalytically, the proton shuttle residue His64 in CA-II was found to occupy a novel conformation named the “faux in” that brought the imidazole group even closer to the Zn2+ compared to the “in” conformation. Overall, compared to traditional MD simulations the incorporation of DRN allowed more detailed investigations into the variant mechanisms of action. This highlights the importance of network analysis in the study of the effects of missense mutations on the structure and function of proteins. Investigations of diseases at the molecular level is essential in the identification of disease pathogenesis and assists with the development of specifically tailored and better treatment options especially in the cases of genetically associated rare diseases.
- Full Text:
- Date Issued: 2020