Siamese Network for Microfinance Credit Scoring under Low-Data Regimes
This project explores how Siamese networks and representation learning can improve credit scoring in microfinance under low-data and highly imbalanced conditions. Traditional ensemble methods like Random Forest and CatBoost perform well on balanced datasets but degrade sharply when defaults are rare—a common challenge in real-world microfinance.
Using a subset of the Kiva crowdfunding dataset (1,000 loans), the study benchmarks Random Forest, CatBoost, Gradient Boosting, Gaussian Process Classifier, and a Siamese network framework across both balanced (50:50) and imbalanced (95:5) splits. While Random Forest achieved the strongest results in balanced settings (ROC AUC = 0.90, PR AUC = 0.90), the Siamese network demonstrated greater robustness under imbalance, maintaining competitive precision-recall performance (PR AUC = 0.81).
Ablation studies further showed that embedding choice, calibration methods, and downstream classifiers significantly impact performance. The best-performing variant combined Siamese embeddings with Logistic Regression and isotonic calibration, balancing discrimination and probability reliability.
The findings highlight the promise of representation learning for inclusive credit scoring in resource-constrained environments, while future work will explore improved calibration, temporal borrower modeling, and multimodal data integration.