Onur Yaman, M.Sc.
Department of Scientific Computing
February 2026
Supervisor: Ömür Uğur (Institute of Applied Mathematics, Middle East Technical University, Ankara)
Abstract
Datasets used in credit risk modelling and fraud detection are typically highly imbalanced, where defaults or fraudulent transactions constitute only a small fraction of all observations. Accurately identifying these minority events is crucial, since missed defaults lead to underestimated expected credit losses, inaccurate capital requirements, and loan portfolio mispricing, while undetected fraud causes direct financial losses and operational risk. In such settings, classical classifiers, particularly gradient-boosted decision trees such as LightGBM and XGBoost, often struggle because the binary cross-entropy objective is dominated by the majority class.
This thesis proposes the Feature Aware Conditional Diffusion Oversampler, a diffusion based oversampling framework tailored for severely imbalanced financial tabular data. Feature Aware Conditional Diffusion Oversampler builds on the Denoising Diffusion Probabilistic Model paradigm and extends it with task-oriented mechanisms to improve minority sample quality. Specifically, Feature Aware Conditional Diffusion Oversampler conditions the denoising process using Shapley Additive Explanations derived feature-importance information via Feature Wise Linear Modulation, encouraging generation toward class-consistent regions of the minority manifold. Classifier free guidance further shapes the sampling trajectory, and a two-stage filtering strategy based on geometric proximity and probability-based consistency removes low-quality candidates and retains informative samples.
Samples generated by Feature Aware Conditional Diffusion Oversampler are used to augment training data, and downstream LightGBM and XGBoost classification models are evaluated on two probability-of-default datasets and one credit card fraud dataset. Results show that proposed method consistently improves minority-sensitive metrics such as Recall, F1-score, G-Mean, and Area Under Precision Recall Curve compared to widely used oversampling baselines, supporting more reliable detection of financially critical minority events.
Keywords: Imbalanced Classification, Diffusion Based Oversampling, Feature Importance Guided Generation, Credit Risk Modelling, Fraud Detection