A central enrichment-based comparison of two alternative methods of generating transcription factor binding motifs from protein binding microarray data
- Authors: Mahaye, Ntombikayise
- Date: 2013 , 2013-03-13
- Subjects: Transcription factors , Bioinformatics , Protein binding , Protein microarrays , Cell lines
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:3890 , http://hdl.handle.net/10962/d1003049 , Transcription factors , Bioinformatics , Protein binding , Protein microarrays , Cell lines
- Description: Characterising transcription factor binding sites (TFBS) is an important problem in bioinformatics, since predicting binding sites has many applications such as predicting gene regulation. ChIP-seq is a powerful in vivo method for generating genome-wide putative binding regions for transcription factors (TFs). CentriMo is an algorithm that measures central enrichment of a motif and has previously been used as motif enrichment analysis (MEA) tool. CentriMo uses the fact that ChIP-seq peak calling methods are likely to be biased towards the centre of the putative binding region, at least in cases where there is direct binding. CentriMo calculates a binomial p-value representing central enrichment, based on the central bias of the binding site with the highest likelihood ratio. In cases where binding is indirect or involves cofactors, a more complex distribution of preferred binding sites may occur but, in many cases, a low CentriMo p-value and low width of maximum enrichment (about 100bp) are strong evidence that the motif in question is the true binding motif. Several other MEA tools have been developed, but they do not consider motif central enrichment. The study investigates the claim made by Zhao and Stormo (2011) that they have identified a simpler method than that used to derive the UniPROBE motif database for creating motifs from protein binding microarray (PBM) data, which they call BEEML-PBM (Binding Energy Estimation by Maximum Likelihood-PBM). To accomplish this, CentriMo is employed on 13 motifs from both motif databases. The results indicate that there is no conclusive difference in the quality of motifs from the original PBM and BEEML-PBM approaches. CentriMo provides an understanding of the mechanisms by which TFs bind to DNA. Out of 13 TFs for which ChIP-seq data is used, BEEML-PBM reports five better motifs and twice it has not had any central enrichment when the best PBM motif does. PBM approach finds seven motifs with better central enrichment. On the other hand, across all variations, the number of examples where PBM is better is not high enough to conclude that it is overall the better approach. Some TFs bind directly to DNA, some indirect or in combination with other TFs. Some of the predicted mechanisms are supported by literature evidence. This study further revealed that the binding specificity of a TF is different in different cell types and development stages. A TF is up-regulated in a cell line where it performs its biological function. The discovery of cell line differences, which has not been done before in any CentriMo study, is interesting and provides reasons to study this further.
- Full Text:
- Date Issued: 2013
- Authors: Mahaye, Ntombikayise
- Date: 2013 , 2013-03-13
- Subjects: Transcription factors , Bioinformatics , Protein binding , Protein microarrays , Cell lines
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:3890 , http://hdl.handle.net/10962/d1003049 , Transcription factors , Bioinformatics , Protein binding , Protein microarrays , Cell lines
- Description: Characterising transcription factor binding sites (TFBS) is an important problem in bioinformatics, since predicting binding sites has many applications such as predicting gene regulation. ChIP-seq is a powerful in vivo method for generating genome-wide putative binding regions for transcription factors (TFs). CentriMo is an algorithm that measures central enrichment of a motif and has previously been used as motif enrichment analysis (MEA) tool. CentriMo uses the fact that ChIP-seq peak calling methods are likely to be biased towards the centre of the putative binding region, at least in cases where there is direct binding. CentriMo calculates a binomial p-value representing central enrichment, based on the central bias of the binding site with the highest likelihood ratio. In cases where binding is indirect or involves cofactors, a more complex distribution of preferred binding sites may occur but, in many cases, a low CentriMo p-value and low width of maximum enrichment (about 100bp) are strong evidence that the motif in question is the true binding motif. Several other MEA tools have been developed, but they do not consider motif central enrichment. The study investigates the claim made by Zhao and Stormo (2011) that they have identified a simpler method than that used to derive the UniPROBE motif database for creating motifs from protein binding microarray (PBM) data, which they call BEEML-PBM (Binding Energy Estimation by Maximum Likelihood-PBM). To accomplish this, CentriMo is employed on 13 motifs from both motif databases. The results indicate that there is no conclusive difference in the quality of motifs from the original PBM and BEEML-PBM approaches. CentriMo provides an understanding of the mechanisms by which TFs bind to DNA. Out of 13 TFs for which ChIP-seq data is used, BEEML-PBM reports five better motifs and twice it has not had any central enrichment when the best PBM motif does. PBM approach finds seven motifs with better central enrichment. On the other hand, across all variations, the number of examples where PBM is better is not high enough to conclude that it is overall the better approach. Some TFs bind directly to DNA, some indirect or in combination with other TFs. Some of the predicted mechanisms are supported by literature evidence. This study further revealed that the binding specificity of a TF is different in different cell types and development stages. A TF is up-regulated in a cell line where it performs its biological function. The discovery of cell line differences, which has not been done before in any CentriMo study, is interesting and provides reasons to study this further.
- Full Text:
- Date Issued: 2013
A comparative bioinformatic analysis of zinc binuclear cluster proteins
- Authors: Mthombeni, Jabulani S
- Date: 2005
- Subjects: Bioinformatics , Zinc proteins , GABA
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4004 , http://hdl.handle.net/10962/d1004064 , Bioinformatics , Zinc proteins , GABA
- Description: Members of the zinc binuclear cluster family are important fungal transcriptional regulators sharing a common DNA binding domain. Da181p is a pleotropic zinc binuclear cluster protein involved in the induction of the UGA genes required for the γ-aminobutyrate nitrogen catabolic pathway in Saccharomyces cerevisiae. The zinc binuclear cluster domain is indispensable for function in Da181p and little is known about other domains in this protein. The aim of the study was to explore the zinc binuclear cluster protein family using comparative bioinformatics as a complement to biochemical and structural approaches. A database of all zinc binuclear cluster proteins was composed. A total of 118 zinc binuclear proteins are reported in this work. Thirty nine previously unidentified zinc binuclear cluster proteins were found. Four homologues of Da181p were identified by homology searching. Important sequence motifs were identified in the aligned sequences of Da181p and its homologues. The coiled coil motif found in the Ga14p zinc binuclear cluster protein could not be identified in Da181p and its homologues. This suggested that Da181p did not dimerise through this structural motif as other zinc binuclear cluster proteins. Solvent accessible site that could be phosphorylated by protein kinase C or casein kinase II and the role of such sites in the possible regulation of Da181p function were discussed.
- Full Text:
- Date Issued: 2005
- Authors: Mthombeni, Jabulani S
- Date: 2005
- Subjects: Bioinformatics , Zinc proteins , GABA
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4004 , http://hdl.handle.net/10962/d1004064 , Bioinformatics , Zinc proteins , GABA
- Description: Members of the zinc binuclear cluster family are important fungal transcriptional regulators sharing a common DNA binding domain. Da181p is a pleotropic zinc binuclear cluster protein involved in the induction of the UGA genes required for the γ-aminobutyrate nitrogen catabolic pathway in Saccharomyces cerevisiae. The zinc binuclear cluster domain is indispensable for function in Da181p and little is known about other domains in this protein. The aim of the study was to explore the zinc binuclear cluster protein family using comparative bioinformatics as a complement to biochemical and structural approaches. A database of all zinc binuclear cluster proteins was composed. A total of 118 zinc binuclear proteins are reported in this work. Thirty nine previously unidentified zinc binuclear cluster proteins were found. Four homologues of Da181p were identified by homology searching. Important sequence motifs were identified in the aligned sequences of Da181p and its homologues. The coiled coil motif found in the Ga14p zinc binuclear cluster protein could not be identified in Da181p and its homologues. This suggested that Da181p did not dimerise through this structural motif as other zinc binuclear cluster proteins. Solvent accessible site that could be phosphorylated by protein kinase C or casein kinase II and the role of such sites in the possible regulation of Da181p function were discussed.
- Full Text:
- Date Issued: 2005
Identification of cis-elements and transacting factors involved in the abiotic stress responses of plants
- Authors: Maclear, Athlee
- Date: 2005 , 2013-06-10
- Subjects: Plants -- Effect of stress on , Proteins -- Analysis , Bioinformatics , DNA , Plant genetics
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4074 , http://hdl.handle.net/10962/d1007236 , Plants -- Effect of stress on , Proteins -- Analysis , Bioinformatics , DNA , Plant genetics
- Description: Many stress situations limit plant growth, resulting in crop production difficulties. Population growth, limited availability and over-utilization of arable land, and intolerant crop species have resulted in tremendous strain being placed on agriculturalists to produce enough to sustain the world's population. An understanding of the principles involved in plant resistance to environmental stress will enable scientists to harness these mechanisms to create stress-tolerant crop species, thus increasing crop production, and enabling the farming of previously unproductive land. This research project uses computational and bioinformatics techniques to explore the promoter regions of genes, encoding proteins that are up- or down-regulated in response to specific abiotic stresses, with the aim of identifying common patterns in the cis-elements governing the regulation of these abiotic stress responsive genes. An initial dataset of fifty known genes encoding for proteins reported to be up- or down-regulated in response to plant stresses that result in water-deficit at the cellular level viz. drought, low temperature, and salinity, were identified, and a postgreSQL database created to store relevant information pertaining to these genes and the proteins encoded by them. The genomic DNA was obtained where possible, and the promoter and intron regions identified. The Neural Network Promoter Prediction (NNPP) software package was used to predict the transcription start signal (TSS) and the promoter searching software tool, TESS (Transcription Element Search Software) used to identify known and user-defined cis-elements within the promoter regions of these genes. Currently available promoter prediction software analysis tools are reported to predict one promoter per kilobase of DNA, whilst functional promoters are thought to only occur one in 30-40 kilobases, which indicates that a large perccntage of predictions are likely to be false positives (pedersen et. al., 1999). NNPP was chosen as it was rated as the highest performing promoter prediction software tool by Fickett and Hatzigeorgiou (1997) in a thorough review of eukaryotic promoter prediction algorithms, however results were less than promising as very few predicted TSS were identified in the area 50 bps up- and downstream of the gene start site, where biologically functional TSSs are known to occur (Reese, 2000; Fickett and Hatzigeorgiou, 1997). TESS results seemed to support the hypothesis that drought, low-temperature and high salinity plant stress response proteins have similar as-elements in their promoter regions, and suggested links to various other gene regulation mechanisms viz. gibberellin-, light-, auxin- and development-regulated gene expression, highlighting the vast complexity of plant stress response processes. Although far from conclusive, results provide a valuable basis for future comparative promoter studies that will attempt to deduce possible common transcriptional initiation of abiotic stress response genes. , KMBT_363 , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
- Date Issued: 2005
- Authors: Maclear, Athlee
- Date: 2005 , 2013-06-10
- Subjects: Plants -- Effect of stress on , Proteins -- Analysis , Bioinformatics , DNA , Plant genetics
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4074 , http://hdl.handle.net/10962/d1007236 , Plants -- Effect of stress on , Proteins -- Analysis , Bioinformatics , DNA , Plant genetics
- Description: Many stress situations limit plant growth, resulting in crop production difficulties. Population growth, limited availability and over-utilization of arable land, and intolerant crop species have resulted in tremendous strain being placed on agriculturalists to produce enough to sustain the world's population. An understanding of the principles involved in plant resistance to environmental stress will enable scientists to harness these mechanisms to create stress-tolerant crop species, thus increasing crop production, and enabling the farming of previously unproductive land. This research project uses computational and bioinformatics techniques to explore the promoter regions of genes, encoding proteins that are up- or down-regulated in response to specific abiotic stresses, with the aim of identifying common patterns in the cis-elements governing the regulation of these abiotic stress responsive genes. An initial dataset of fifty known genes encoding for proteins reported to be up- or down-regulated in response to plant stresses that result in water-deficit at the cellular level viz. drought, low temperature, and salinity, were identified, and a postgreSQL database created to store relevant information pertaining to these genes and the proteins encoded by them. The genomic DNA was obtained where possible, and the promoter and intron regions identified. The Neural Network Promoter Prediction (NNPP) software package was used to predict the transcription start signal (TSS) and the promoter searching software tool, TESS (Transcription Element Search Software) used to identify known and user-defined cis-elements within the promoter regions of these genes. Currently available promoter prediction software analysis tools are reported to predict one promoter per kilobase of DNA, whilst functional promoters are thought to only occur one in 30-40 kilobases, which indicates that a large perccntage of predictions are likely to be false positives (pedersen et. al., 1999). NNPP was chosen as it was rated as the highest performing promoter prediction software tool by Fickett and Hatzigeorgiou (1997) in a thorough review of eukaryotic promoter prediction algorithms, however results were less than promising as very few predicted TSS were identified in the area 50 bps up- and downstream of the gene start site, where biologically functional TSSs are known to occur (Reese, 2000; Fickett and Hatzigeorgiou, 1997). TESS results seemed to support the hypothesis that drought, low-temperature and high salinity plant stress response proteins have similar as-elements in their promoter regions, and suggested links to various other gene regulation mechanisms viz. gibberellin-, light-, auxin- and development-regulated gene expression, highlighting the vast complexity of plant stress response processes. Although far from conclusive, results provide a valuable basis for future comparative promoter studies that will attempt to deduce possible common transcriptional initiation of abiotic stress response genes. , KMBT_363 , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
- Date Issued: 2005
Stress-inducible protein 1: a bioinformatic analysis of the human, mouse and yeast STI1 gene structure
- Authors: Aken, Bronwen Louise
- Date: 2005
- Subjects: Molecular chaperones , Proteins -- Analysis , Heat shock proteins , Bioinformatics , Genetics -- Data processing
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:3990 , http://hdl.handle.net/10962/d1004049 , Molecular chaperones , Proteins -- Analysis , Heat shock proteins , Bioinformatics , Genetics -- Data processing
- Description: Stress-inducible protein 1 (Sti1) is a 60 kDa eukaryotic protein that is important under stress and non-stress conditions. Human Sti1 is also known as the Hsp70/Hsp90 organising protein (Hop) that coordinates the functional cooperation of heat shock protein 70 (Hsp70) and heat shock protein 90 (Hsp90) during the folding of various transcription factors and kinases, including certain oncogenic proteins and prion proteins. Limited studies have been conducted on the STI1 gene structure. Thus, the aim of this study was to develop a comprehensive description of human STI1 (hSTI1), mouse STI1 (mSTI1), and yeast STI1 (ySTI1) genes, using a bioinformatic approach. Genes encoded near the STI1 loci were identified for the three organisms using National Centre for Biotechnology Information (NCBI) MapViewer and the Saccharomyces Genome Database. Exon/intron boundaries were predicted using Hidden Markov model gene prediction software (HMMGene) and Genscan, and by alignment of the mRNA sequence with the genomic DNA sequence. Transcription factor binding sites (TFBS) were predicted by scanning the region 1000 base pairs (bp) upstream of the STI1 orthologues’ transcription start site (TSS) with Alibaba, Transcription element search software (TESS) and Transcription factor search (TFSearch). The promoter region was defined by comparing the number, type and position of TFBS across the orthologous STI1 genes. Additional putative TFBS were identified for ySTI1 by searching with software that aligns nucleic acid conserved elements (AlignACE) for over-represented motifs in the region upstream of the TSS of genes thought to be co-regulated with ySTI1. This study showed that hSTI1 and mSTI1 occur in a region of synteny with a number of genes of related function. Both hSTI1 and mSTI1 comprised 14 putative exons, while ySTI1 was encoded on a single exon. Human and mouse STI1 shared a perfectly conserved 55 bp region spanning their predicted TSS, although their TATA boxes were not conserved. A putative CpG island was identified in the region from -500 to +100 bp relative to the hSTI1 and mSTI1 TSS. This region overlapped with a region of high TFBS density, suggesting that the core promoter region was located in the region approximately 100 to 200 bp upstream of the TSS. Several conserved clusters of TFBS were also identified upstream of this promoter region, including binding sites for stimulatory protein 1 (Sp1), heat shock factor (HSF), nuclear factor kappa B (NF-kappaB), and the cAMP/enhancer binding protein (C/EBP). Microarray data suggested that ySTI1 was co-regulated with several heat shock proteins and substrates of the Hsp70/Hsp90 heterocomplex, and several putative regulatory elements were identified in the upstream region of these co-regulated genes, including a motif for HSF binding. The results of this research suggest several avenues of future experimental work, including the confirmation of the proposed core promoter, upstream regulatory elements, and CpG island, and the investigation into the co-regulation of mammalian STI1 with its surrounding genes. These results could also be used to inform STI1 gene knockout experiments in mice, to assess the biological importance of mammalian STI1.
- Full Text:
- Date Issued: 2005
- Authors: Aken, Bronwen Louise
- Date: 2005
- Subjects: Molecular chaperones , Proteins -- Analysis , Heat shock proteins , Bioinformatics , Genetics -- Data processing
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:3990 , http://hdl.handle.net/10962/d1004049 , Molecular chaperones , Proteins -- Analysis , Heat shock proteins , Bioinformatics , Genetics -- Data processing
- Description: Stress-inducible protein 1 (Sti1) is a 60 kDa eukaryotic protein that is important under stress and non-stress conditions. Human Sti1 is also known as the Hsp70/Hsp90 organising protein (Hop) that coordinates the functional cooperation of heat shock protein 70 (Hsp70) and heat shock protein 90 (Hsp90) during the folding of various transcription factors and kinases, including certain oncogenic proteins and prion proteins. Limited studies have been conducted on the STI1 gene structure. Thus, the aim of this study was to develop a comprehensive description of human STI1 (hSTI1), mouse STI1 (mSTI1), and yeast STI1 (ySTI1) genes, using a bioinformatic approach. Genes encoded near the STI1 loci were identified for the three organisms using National Centre for Biotechnology Information (NCBI) MapViewer and the Saccharomyces Genome Database. Exon/intron boundaries were predicted using Hidden Markov model gene prediction software (HMMGene) and Genscan, and by alignment of the mRNA sequence with the genomic DNA sequence. Transcription factor binding sites (TFBS) were predicted by scanning the region 1000 base pairs (bp) upstream of the STI1 orthologues’ transcription start site (TSS) with Alibaba, Transcription element search software (TESS) and Transcription factor search (TFSearch). The promoter region was defined by comparing the number, type and position of TFBS across the orthologous STI1 genes. Additional putative TFBS were identified for ySTI1 by searching with software that aligns nucleic acid conserved elements (AlignACE) for over-represented motifs in the region upstream of the TSS of genes thought to be co-regulated with ySTI1. This study showed that hSTI1 and mSTI1 occur in a region of synteny with a number of genes of related function. Both hSTI1 and mSTI1 comprised 14 putative exons, while ySTI1 was encoded on a single exon. Human and mouse STI1 shared a perfectly conserved 55 bp region spanning their predicted TSS, although their TATA boxes were not conserved. A putative CpG island was identified in the region from -500 to +100 bp relative to the hSTI1 and mSTI1 TSS. This region overlapped with a region of high TFBS density, suggesting that the core promoter region was located in the region approximately 100 to 200 bp upstream of the TSS. Several conserved clusters of TFBS were also identified upstream of this promoter region, including binding sites for stimulatory protein 1 (Sp1), heat shock factor (HSF), nuclear factor kappa B (NF-kappaB), and the cAMP/enhancer binding protein (C/EBP). Microarray data suggested that ySTI1 was co-regulated with several heat shock proteins and substrates of the Hsp70/Hsp90 heterocomplex, and several putative regulatory elements were identified in the upstream region of these co-regulated genes, including a motif for HSF binding. The results of this research suggest several avenues of future experimental work, including the confirmation of the proposed core promoter, upstream regulatory elements, and CpG island, and the investigation into the co-regulation of mammalian STI1 with its surrounding genes. These results could also be used to inform STI1 gene knockout experiments in mice, to assess the biological importance of mammalian STI1.
- Full Text:
- Date Issued: 2005
The role of parallel computing in bioinformatics
- Authors: Akhurst, Timothy John
- Date: 2005
- Subjects: Bioinformatics , Parallel programming (Computer science) , LINDA (Computer system) , Java (Computer program language) , Parallel processing (Electronic computers) , Genomics -- Data processing
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:3986 , http://hdl.handle.net/10962/d1004045 , Bioinformatics , Parallel programming (Computer science) , LINDA (Computer system) , Java (Computer program language) , Parallel processing (Electronic computers) , Genomics -- Data processing
- Description: The need to intelligibly capture, manage and analyse the ever-increasing amount of publicly available genomic data is one of the challenges facing bioinformaticians today. Such analyses are in fact impractical using uniprocessor machines, which has led to an increasing reliance on clusters of commodity-priced computers. An existing network of cheap, commodity PCs was utilised as a single computational resource for parallel computing. The performance of the cluster was investigated using a whole genome-scanning program written in the Java programming language. The TSpaces framework, based on the Linda parallel programming model, was used to parallelise the application. Maximum speedup was achieved at between 30 and 50 processors, depending on the size of the genome being scanned. Together with this, the associated significant reductions in wall-clock time suggest that both parallel computing and Java have a significant role to play in the field of bioinformatics.
- Full Text:
- Date Issued: 2005
- Authors: Akhurst, Timothy John
- Date: 2005
- Subjects: Bioinformatics , Parallel programming (Computer science) , LINDA (Computer system) , Java (Computer program language) , Parallel processing (Electronic computers) , Genomics -- Data processing
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:3986 , http://hdl.handle.net/10962/d1004045 , Bioinformatics , Parallel programming (Computer science) , LINDA (Computer system) , Java (Computer program language) , Parallel processing (Electronic computers) , Genomics -- Data processing
- Description: The need to intelligibly capture, manage and analyse the ever-increasing amount of publicly available genomic data is one of the challenges facing bioinformaticians today. Such analyses are in fact impractical using uniprocessor machines, which has led to an increasing reliance on clusters of commodity-priced computers. An existing network of cheap, commodity PCs was utilised as a single computational resource for parallel computing. The performance of the cluster was investigated using a whole genome-scanning program written in the Java programming language. The TSpaces framework, based on the Linda parallel programming model, was used to parallelise the application. Maximum speedup was achieved at between 30 and 50 processors, depending on the size of the genome being scanned. Together with this, the associated significant reductions in wall-clock time suggest that both parallel computing and Java have a significant role to play in the field of bioinformatics.
- Full Text:
- Date Issued: 2005
- «
- ‹
- 1
- ›
- »