ML Loan Default Predictor

A machine learning pipeline that predicts loan defaults using XGBoost and scikit-learn. Includes data preprocessing, feature engineering, and a Streamlit dashboard.

June 10, 2024

PythonXGBoostscikit-learnStreamlitPandas

Live Demo View Source

Overview

A complete machine learning solution for predicting loan default risk, built as a capstone project combining data science and web development skills. The model processes historical loan data and outputs risk scores that financial institutions can use to make informed lending decisions.

Key Features

Data Pipeline: Automated data cleaning, imputation, and feature engineering
Model Training: XGBoost classifier with hyperparameter tuning via Optuna
Interactive Dashboard: Streamlit app for exploring predictions and model performance
API Endpoint: FastAPI service for real-time predictions
Explainability: SHAP values for model interpretation

Technical Details

The dataset contains 50,000+ historical loan records with 30+ features. After feature engineering (including credit utilization ratios, payment history aggregates, and debt-to-income calculations), the XGBoost model achieves 92% AUC-ROC on the holdout test set.

Model Pipeline

Data ingestion and validation with Pandas
Missing value imputation using KNN imputer
Feature encoding with target encoding for high-cardinality categoricals
Hyperparameter optimization with Optuna (200 trials)
Model evaluation with cross-validation and calibration curves

Lessons Learned

Working with imbalanced datasets required careful use of SMOTE and threshold tuning. Learned that feature engineering has far more impact than model selection for tabular data.