ML Loan Default Predictor
A machine learning pipeline that predicts loan defaults using XGBoost and scikit-learn. Includes data preprocessing, feature engineering, and a Streamlit dashboard.
Overview
A complete machine learning solution for predicting loan default risk, built as a capstone project combining data science and web development skills. The model processes historical loan data and outputs risk scores that financial institutions can use to make informed lending decisions.
Key Features
- Data Pipeline: Automated data cleaning, imputation, and feature engineering
- Model Training: XGBoost classifier with hyperparameter tuning via Optuna
- Interactive Dashboard: Streamlit app for exploring predictions and model performance
- API Endpoint: FastAPI service for real-time predictions
- Explainability: SHAP values for model interpretation
Technical Details
The dataset contains 50,000+ historical loan records with 30+ features. After feature engineering (including credit utilization ratios, payment history aggregates, and debt-to-income calculations), the XGBoost model achieves 92% AUC-ROC on the holdout test set.
Model Pipeline
- Data ingestion and validation with Pandas
- Missing value imputation using KNN imputer
- Feature encoding with target encoding for high-cardinality categoricals
- Hyperparameter optimization with Optuna (200 trials)
- Model evaluation with cross-validation and calibration curves
Lessons Learned
Working with imbalanced datasets required careful use of SMOTE and threshold tuning. Learned that feature engineering has far more impact than model selection for tabular data.