List of Titles

Statistical classification, an application to credit default

Authors: Sikhakhane, Anele Gcina
Date: 2024-10-11
Subjects: Binary classification , Default (Finance) , Credit cards , Credit risk , Machine learning , Variables (Mathematics)
Language: English
Type: Academic theses , Master's theses , text
Identifier: http://hdl.handle.net/10962/465069 , vital:76570
Description: Statistical learning has been used in both industry and academia to create credit scoring models. These models are used to predict who might default on their loan repayments, thus minimizing the risk financial institutions face. In this study six traditional and one more recent classifier, namely kNN, LDA, CART, RF, AdaBoost, XGBoost and SynBoost were used to predict who might default on their loans. The data set used in this study was imbalanced thus sampling and performance evaluation techniques were investigated and used to balance the class distribution and assess the classifiers performance. In addition to the standard variables and data set, new variables called synthetic variables and synthetic data sets were produced, investigated and used to predict who might default on their loans. This study found that the synthetic data set had strong predictive power and sampling methods negatively affected the classifiers performance. The best-performing classifier was XGBoost, with an AUC score of 0.7732. , Thesis (MSc) -- Faculty of Science, Statistics, 2024
Full Text:

The application of statistical classification to predict sovereign default

Authors: Vele, Rendani
Date: 2023-10-13
Subjects: Statistical classification , Neural networks (Computer science) , Regression analysis , Logits , Probits , Multiple imputation (Statistics) , Markov chain Monte Carlo , Debts, Public
Language: English
Type: Academic theses , Master's theses , text
Identifier: http://hdl.handle.net/10962/424563 , vital:72164
Description: When considering sovereign loans, it is imperative for a financial institution to have a good understanding of the sovereign they are transacting with. Defaults can occur if proper evaluation steps are not considered. To aid in the prediction of potential sovereign defaults, financial institutions, together with grading companies, quantify the risk associated with issuing a loan to a sovereign by developing sovereign default early warning systems (EWS). Various classification models are considered in this study to develop sovereign default EWS. These models are the binary logit, probit, Bayesian additive regression trees, and artificial neural networks. This study investigates the predictive performance of the various classification techniques. Sovereign information is not readily available, so missing data techniques are considered in order to counter the data availability issue. Sovereign defaults are rare, which results in an imbalance in the distribution of the binary dependent variable. To assess data sets with such characteristics, metrics for imbalanced data are considered for model performance comparison. From the findings, the Bayesian additive regression technique generated better results than the other techniques when considering a basic data analysis. Moreover when cross-validation was considered, the neural network technique performed best. In addition, regional models had better results than the global model when considering model predictive capability. The significance of this study is to develop sovereign default prediction models using various classification techniques focused on enhancing previous literature and analysis through the application of Bayesian additive regression trees. , Thesis (MSc) -- Faculty of Science, Statistics, 2023
Full Text:

Statistical and Mathematical Learning: an application to fraud detection and prevention

Authors: Hamlomo, Sisipho
Date: 2022-04-06
Subjects: Credit card fraud , Bootstrap (Statistics) , Support vector machines , Neural networks (Computer science) , Decision trees , Machine learning , Cross-validation‎ , Imbalanced data‎
Language: English
Type: Master's thesis , text
Identifier: http://hdl.handle.net/10962/233795 , vital:50128
Description: Credit card fraud is an ever-growing problem. There has been a rapid increase in the rate of fraudulent activities in recent years resulting in a considerable loss to several organizations, companies, and government agencies. Many researchers have focused on detecting fraudulent behaviours early using advanced machine learning techniques. However, credit card fraud detection is not a straightforward task since fraudulent behaviours usually differ for each attempt and the dataset is highly imbalanced, that is, the frequency of non-fraudulent cases outnumbers the frequency of fraudulent cases. In the case of the European credit card dataset, we have a ratio of approximately one fraudulent case to five hundred and seventy-eight non-fraudulent cases. Different methods were implemented to overcome this problem, namely random undersampling, one-sided sampling, SMOTE combined with Tomek links and parameter tuning. Predictive classifiers, namely logistic regression, decision trees, k-nearest neighbour, support vector machine and multilayer perceptrons, are applied to predict if a transaction is fraudulent or non-fraudulent. The model's performance is evaluated based on recall, precision, F1-score, the area under receiver operating characteristics curve, geometric mean and Matthew correlation coefficient. The results showed that the logistic regression classifier performed better than other classifiers except when the dataset was oversampled. , Thesis (MSc) -- Faculty of Science, Statistics, 2022
Full Text:

A modelling approach to the analysis of complex survey data

Authors: Dlangamandla, Olwethu
Date: 2021-10-29
Subjects: Sampling (Statistics) , Linear models (Statistics) , Multilevel models (Statistics) , Logistic regression analysis , Complex survey data
Language: English
Type: Master's theses , text
Identifier: http://hdl.handle.net/10962/192955 , vital:45284
Description: Surveys are an essential tool for collecting data and most surveys use complex sampling designs to collect the data. Complex sampling designs are used mainly to enhance representativeness in the sample by accounting for the underlying structure of the population. This often results in data that are non-independent and clustered. Ignoring complex design features such as clustering, stratification, multistage and unequal probability sampling may result in inaccurate and incorrect inference. An overview of, and difference between, design-based and model-based approaches to inference for complex survey data has been discussed. This study adopts a model-based approach. The objective of this study is to discuss and describe the modelling approach in analysing complex survey data. This is specifically done by introducing the principle inference methods under which data from complex surveys may be analysed. In particular, discussions on the theory and methods of model fitting for the analysis of complex survey data are presented. We begin by discussing unique features of complex survey data and explore appropriate methods of analysis that account for the complexity inherent in the survey data. We also explore the widely applied logistic regression modelling of binary data in a complex sample survey context. In particular, four forms of logistic regression models are fitted. These models are generalized linear models, multilevel models, mixed effects models and generalized linear mixed models. Simulated complex survey data are used to illustrate the methods and models. Various R packages are used for the analysis. The results presented and discussed in this thesis indicate that a logistic mixed model with first and second level predictors has a better fit compared to a logistic mixed model with first level predictors. In addition, a logistic multilevel model with first and second level predictors and nested random effects provides a better fit to the data compared to other logistic multilevel fitted models. Similar results were obtained from fitting a generalized logistic mixed model with first and second level predictor variables and a generalized linear mixed model with first and second level predictors and nested random effects. , Thesis (MSC) -- Faculty of Science, Statistics, 2021
Full Text:

The application of Classification Trees in the Banking Sector

Authors: Mtwa, Sithayanda
Date: 2021-04
Subjects: To be added
Language: English
Type: thesis , text , Masters , MSc
Identifier: http://hdl.handle.net/10962/178514 , vital:42946
Description: Access restricted until April 2026. , Thesis (MSc) -- Faculty of Science, Statistics, 2021
Full Text:

Default in payment, an application of statistical learning techniques

Authors: Gcakasi, Lulama
Date: 2020
Subjects: Credit -- South Africa -- Risk assessment , Risk management -- Statistical methods -- South Africa , Credit -- Management -- Statistical methods , Commercial statistics
Language: English
Type: text , Thesis , Masters , MSc
Identifier: http://hdl.handle.net/10962/141547 , vital:37984
Description: The ability of financial institutions to detect whether a customer will default on their credit card payment is essential for its profitability. To that effect, financial institutions have credit scoring systems in place to be able to estimate the credit risk associated with a customer. Various classification models are used to develop credit scoring systems such as k-nearest neighbours, logistic regression and classification trees. This study aims to assess the performance of different classification models on the prediction of credit card payment default. Credit data is usually of high dimension and as a result dimension reduction techniques, namely principal component analysis and linear discriminant analysis, are used in this study as a means to improve model performance. Two classification models are used, namely neural networks and support vector machines. Model performance is evaluated using accuracy and area under the curve (AUC). The neuarl network classifier performed better than the support vector machine classifier as it produced higher accuracy rates and AUC values. Dimension reduction techniques were not effective in improving model performance but did result in less computationally expensive models.
Full Text:

Generalized linear models, with applications in fisheries research

Authors: Sidumo, Bonelwa
Date: 2018
Subjects: Western mosquitofish , Analysis of variance , Fisheries Catch effort South Africa Sundays River (Eastern Cape) , Linear models (Statistics) , Multilevel models (Statistics) , Experimental design
Language: English
Type: text , Thesis , Masters , MSc
Identifier: http://hdl.handle.net/10962/61102 , vital:27975
Description: Gambusia affinis (G. affinis) is an invasive fish species found in the Sundays River Valley of the Eastern Cape, South Africa, The relative abundance and population dynamics of G. affinis were quantified in five interconnected impoundments within the Sundays River Valley, This study utilised a G. affinis data set to demonstrate various, classical ANOVA models. Generalized linear models were used to standardize catch per unit effort (CPUE) estimates and to determine environmental variables which influenced the CPUE, Based on the generalized linear model results dam age, mean temperature, Oreochromis mossambicus abundance and Glossogobius callidus abundance had a significant effect on the G. affinis CPUE. The Albany Angling Association collected data during fishing tag and release events. These data were utilized to demonstrate repeated measures designs. Mixed-effects models provided a powerful and flexible tool for analyzing clustered data such as repeated measures data and nested data, lienee it has become tremendously popular as a framework for the analysis of bio-behavioral experiments. The results show that the mixed-effects methods proposed in this study are more efficient than those based on generalized linear models. These data were better modeled with mixed-effects models due to their flexibility in handling missing data.
Full Text:

Prediction of protein secondary structure using binary classificationtrees, naive Bayes classifiers and the Logistic Regression Classifier

Authors: Eldud Omer, Ahmed Abdelkarim
Date: 2016
Subjects: Bayesian statistical decision theory , Logistic regression analysis , Biostatistics , Proteins -- Structure
Language: English
Type: Thesis , Masters , MSc
Identifier: vital:5581 , http://hdl.handle.net/10962/d1019985
Description: The secondary structure of proteins is predicted using various binary classifiers. The data are adopted from the RS126 database. The original data consists of protein primary and secondary structure sequences. The original data is encoded using alphabetic letters. These data are encoded into unary vectors comprising ones and zeros only. Different binary classifiers, namely the naive Bayes, logistic regression and classification trees using hold-out and 5-fold cross validation are trained using the encoded data. For each of the classifiers three classification tasks are considered, namely helix against not helix (H/∼H), sheet against not sheet (S/∼S) and coil against not coil (C/∼C). The performance of these binary classifiers are compared using the overall accuracy in predicting the protein secondary structure for various window sizes. Our result indicate that hold-out cross validation achieved higher accuracy than 5-fold cross validation. The Naive Bayes classifier, using 5-fold cross validation achieved, the lowest accuracy for predicting helix against not helix. The classification tree classifiers, using 5-fold cross validation, achieved the lowest accuracies for both coil against not coil and sheet against not sheet classifications. The logistic regression classier accuracy is dependent on the window size; there is a positive relationship between the accuracy and window size. The logistic regression classier approach achieved the highest accuracy when compared to the classification tree and Naive Bayes classifiers for each classification task; predicting helix against not helix with accuracy 77.74 percent, for sheet against not sheet with accuracy 81.22 percent and for coil against not coil with accuracy 73.39 percent. It is noted that it is easier to compare classifiers if the classification process could be completely facilitated in R. Alternatively, it would be easier to assess these logistic regression classifiers if SPSS had a function to determine the accuracy of the logistic regression classifier.
Full Text:

Showing items 1 - 8 of 8

Statistical classification, an application to credit default

The application of statistical classification to predict sovereign default

Statistical and Mathematical Learning: an application to fraud detection and prevention

A modelling approach to the analysis of complex survey data

The application of Classification Trees in the Banking Sector

Default in payment, an application of statistical learning techniques

Generalized linear models, with applications in fisheries research

Prediction of protein secondary structure using binary classificationtrees, naive Bayes classifiers and the Logistic Regression Classifier