Siamese Network for Microfinance Credit Scoring under Low-Data Regimes

Siamese Network for Microfinance Credit Scoring under Low-Data Regimes

This project explores how Siamese networks and representation learning can improve credit scoring in microfinance under low-data and highly imbalanced conditions. Traditional ensemble methods like Random Forest and CatBoost perform well on balanced datasets but degrade sharply when defaults are rare—a common challenge in real-world microfinance.

Using a subset of the Kiva crowdfunding dataset (1,000 loans), the study benchmarks Random Forest, CatBoost, Gradient Boosting, Gaussian Process Classifier, and a Siamese network framework across both balanced (50:50) and imbalanced (95:5) splits. While Random Forest achieved the strongest results in balanced settings (ROC AUC = 0.90, PR AUC = 0.90), the Siamese network demonstrated greater robustness under imbalance, maintaining competitive precision-recall performance (PR AUC = 0.81).

Ablation studies further showed that embedding choice, calibration methods, and downstream classifiers significantly impact performance. The best-performing variant combined Siamese embeddings with Logistic Regression and isotonic calibration, balancing discrimination and probability reliability.

The findings highlight the promise of representation learning for inclusive credit scoring in resource-constrained environments, while future work will explore improved calibration, temporal borrower modeling, and multimodal data integration.

Vladislav Bogomazov

Vladislav Bogomazov

BSc in Economics, Management and Computer Science at Bocconi University

Marco Lomele*

Marco Lomele*

MSc in Data Science and Business Analytics at Bocconi University