Transcription factor binding specificity and occupancy : elucidation, modelling and evaluation
- Authors: Kibet, Caleb Kipkurui
- Date: 2017
- Subjects: Transcription factors , Transcription factors -- Data processing , Motif Assessment and Ranking Suite
- Language: English
- Type: Thesis , Doctoral , PhD
- Identifier: vital:21185 , http://hdl.handle.net/10962/6838
- Description: The major contributions of this thesis are addressing the need for an objective quality evaluation of a transcription factor binding model, demonstrating the value of the tools developed to this end and elucidating how in vitro and in vivo information can be utilized to improve TF binding specificity models. Accurate elucidation of TF binding specificity remains an ongoing challenge in gene regulatory research. Several in vitro and in vivo experimental techniques have been developed followed by a proliferation of algorithms, and ultimately, the binding models. This increase led to a choice problem for the end users: which tools to use, and which is the most accurate model for a given TF? Therefore, the first section of this thesis investigates the motif assessment problem: how scoring functions, choice and processing of benchmark data, and statistics used in evaluation affect motif ranking. This analysis revealed that TF motif quality assessment requires a systematic comparative analysis, and that scoring functions used have a TF-specific effect on motif ranking. These results advised the design of a Motif Assessment and Ranking Suite MARS, supported by PBM and ChIP-seq benchmark data and an extensive collection of PWM motifs. MARS implements consistency, enrichment, and scoring and classification-based motif evaluation algorithms. Transcription factor binding is also influenced and determined by contextual factors: chromatin accessibility, competition or cooperation with other TFs, cell line or condition specificity, binding locality (e.g. proximity to transcription start sites) and the shape of the binding site (DNA-shape). In vitro techniques do not capture such context; therefore, this thesis also combines PBM and DNase-seq data using a comparative k-mer enrichment approach that compares open chromatin with genome-wide prevalence, achieving a modest performance improvement when benchmarked on ChIP-seq data. Finally, since statistical and probabilistic methods cannot capture all the information that determine binding, a machine learning approach (XGBooost) was implemented to investigate how the features contribute to TF specificity and occupancy. This combinatorial approach improves the predictive ability of TF specificity models with the most predictive feature being chromatin accessibility, while the DNA-shape and conservation information all significantly improve on the baseline model of k-mer and DNase data. The results and the tools introduced in this thesis are useful for systematic comparative analysis (via MARS) and a combinatorial approach to modelling TF binding specificity, including appropriate feature engineering practices for machine learning modelling.
- Full Text:
- Date Issued: 2017
- Authors: Kibet, Caleb Kipkurui
- Date: 2017
- Subjects: Transcription factors , Transcription factors -- Data processing , Motif Assessment and Ranking Suite
- Language: English
- Type: Thesis , Doctoral , PhD
- Identifier: vital:21185 , http://hdl.handle.net/10962/6838
- Description: The major contributions of this thesis are addressing the need for an objective quality evaluation of a transcription factor binding model, demonstrating the value of the tools developed to this end and elucidating how in vitro and in vivo information can be utilized to improve TF binding specificity models. Accurate elucidation of TF binding specificity remains an ongoing challenge in gene regulatory research. Several in vitro and in vivo experimental techniques have been developed followed by a proliferation of algorithms, and ultimately, the binding models. This increase led to a choice problem for the end users: which tools to use, and which is the most accurate model for a given TF? Therefore, the first section of this thesis investigates the motif assessment problem: how scoring functions, choice and processing of benchmark data, and statistics used in evaluation affect motif ranking. This analysis revealed that TF motif quality assessment requires a systematic comparative analysis, and that scoring functions used have a TF-specific effect on motif ranking. These results advised the design of a Motif Assessment and Ranking Suite MARS, supported by PBM and ChIP-seq benchmark data and an extensive collection of PWM motifs. MARS implements consistency, enrichment, and scoring and classification-based motif evaluation algorithms. Transcription factor binding is also influenced and determined by contextual factors: chromatin accessibility, competition or cooperation with other TFs, cell line or condition specificity, binding locality (e.g. proximity to transcription start sites) and the shape of the binding site (DNA-shape). In vitro techniques do not capture such context; therefore, this thesis also combines PBM and DNase-seq data using a comparative k-mer enrichment approach that compares open chromatin with genome-wide prevalence, achieving a modest performance improvement when benchmarked on ChIP-seq data. Finally, since statistical and probabilistic methods cannot capture all the information that determine binding, a machine learning approach (XGBooost) was implemented to investigate how the features contribute to TF specificity and occupancy. This combinatorial approach improves the predictive ability of TF specificity models with the most predictive feature being chromatin accessibility, while the DNA-shape and conservation information all significantly improve on the baseline model of k-mer and DNase data. The results and the tools introduced in this thesis are useful for systematic comparative analysis (via MARS) and a combinatorial approach to modelling TF binding specificity, including appropriate feature engineering practices for machine learning modelling.
- Full Text:
- Date Issued: 2017
Analysis of transcription factor binding specificity using ChIP-seq data.
- Authors: Kibet, Caleb Kipkurui
- Date: 2014
- Subjects: Transcription factors , Chronic myeloid leukemia , Antioncogenes , Cancer cells -- Growth -- Regulation
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4115 , http://hdl.handle.net/10962/d1013131
- Description: Transcription factors (TFs) are key regulators of gene expression whose failure has been implicated in many diseases, including cancer. They bind at various sites at different specificity depending on the prevailing cellular conditions, disease, development stage or environmental conditions of the cell. TF binding specificity is how well a TF distinguishes functional sites from potential non-functional sites to form a useful regulatory network. Owing to its role in diseases, various techniques have been used to determine TF binding specificity in vitro and in vivo, including chromatin immuno-precipitation followed by massively parallel sequencing (ChIP-seq). ChIP-seq is an in vivo technique that considers how the chromatin landscape affects TF binding. Motif enrichment analysis (MEA) tools are used to identify motifs that are over-represented in ChIP-seq peak regions. One such tool, CentriMo, finds over-represented motifs at the center since peak calling software are biased to declaring binding regions centered at the TF binding site. In this study, we investigate the use of CentriMo and other MEA tools to determine the difference in motif enrichment attributed presence of Chronic Myeloid leukemia (CML)), treatment with Interferon (IFN) and Dexamethasone (DEX) compared to control based on Fisher’s exact test; using uniform peaks ChIP-seq data generated by the ENCODE consortium. CentriMo proved to be capable. We observed differential motif enrichment of TFs with tumor promoter activity: YY1, CEBPA, Egr1, Cmyc family, Gata1 and JunD in K562 while Stat1, Irf1, and Runx1 in Gm12878. Enrichment of CTCF in Gm12878 with YY1 as the immuno-precipitated (ChIP-ed) factor and the presence of significant spacing (SpaMo analysis) of CTCF and YY1 in Gm12878 but not in K562 could show that CTCF, as a repressor, helps in maintaining the required YY1 level in a normal cell line. IFN might reduce Cmyc and the Jun family of TFs binding via the repressive action of CTCF and E2f2. We also show that the concentration of DEX treatment affects motif enrichment with 50nm being an optimum concentration for Gr binding by maintaining open chromatin via AP1 TF. This study has demonstrated the usefulness of CentriMo for TF binding specificity analysis.
- Full Text:
- Date Issued: 2014
- Authors: Kibet, Caleb Kipkurui
- Date: 2014
- Subjects: Transcription factors , Chronic myeloid leukemia , Antioncogenes , Cancer cells -- Growth -- Regulation
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4115 , http://hdl.handle.net/10962/d1013131
- Description: Transcription factors (TFs) are key regulators of gene expression whose failure has been implicated in many diseases, including cancer. They bind at various sites at different specificity depending on the prevailing cellular conditions, disease, development stage or environmental conditions of the cell. TF binding specificity is how well a TF distinguishes functional sites from potential non-functional sites to form a useful regulatory network. Owing to its role in diseases, various techniques have been used to determine TF binding specificity in vitro and in vivo, including chromatin immuno-precipitation followed by massively parallel sequencing (ChIP-seq). ChIP-seq is an in vivo technique that considers how the chromatin landscape affects TF binding. Motif enrichment analysis (MEA) tools are used to identify motifs that are over-represented in ChIP-seq peak regions. One such tool, CentriMo, finds over-represented motifs at the center since peak calling software are biased to declaring binding regions centered at the TF binding site. In this study, we investigate the use of CentriMo and other MEA tools to determine the difference in motif enrichment attributed presence of Chronic Myeloid leukemia (CML)), treatment with Interferon (IFN) and Dexamethasone (DEX) compared to control based on Fisher’s exact test; using uniform peaks ChIP-seq data generated by the ENCODE consortium. CentriMo proved to be capable. We observed differential motif enrichment of TFs with tumor promoter activity: YY1, CEBPA, Egr1, Cmyc family, Gata1 and JunD in K562 while Stat1, Irf1, and Runx1 in Gm12878. Enrichment of CTCF in Gm12878 with YY1 as the immuno-precipitated (ChIP-ed) factor and the presence of significant spacing (SpaMo analysis) of CTCF and YY1 in Gm12878 but not in K562 could show that CTCF, as a repressor, helps in maintaining the required YY1 level in a normal cell line. IFN might reduce Cmyc and the Jun family of TFs binding via the repressive action of CTCF and E2f2. We also show that the concentration of DEX treatment affects motif enrichment with 50nm being an optimum concentration for Gr binding by maintaining open chromatin via AP1 TF. This study has demonstrated the usefulness of CentriMo for TF binding specificity analysis.
- Full Text:
- Date Issued: 2014
- «
- ‹
- 1
- ›
- »