Back to Projects

ML Loan Default Predictor

A machine learning pipeline that predicts loan defaults using XGBoost and scikit-learn. Includes data preprocessing, feature engineering, and a Streamlit dashboard.

June 10, 2024
PythonXGBoostscikit-learnStreamlitPandas

Overview

A complete machine learning solution for predicting loan default risk, built as a capstone project combining data science and web development skills. The model processes historical loan data and outputs risk scores that financial institutions can use to make informed lending decisions.

Key Features

  • Data Pipeline: Automated data cleaning, imputation, and feature engineering
  • Model Training: XGBoost classifier with hyperparameter tuning via Optuna
  • Interactive Dashboard: Streamlit app for exploring predictions and model performance
  • API Endpoint: FastAPI service for real-time predictions
  • Explainability: SHAP values for model interpretation

Technical Details

The dataset contains 50,000+ historical loan records with 30+ features. After feature engineering (including credit utilization ratios, payment history aggregates, and debt-to-income calculations), the XGBoost model achieves 92% AUC-ROC on the holdout test set.

Model Pipeline

  1. Data ingestion and validation with Pandas
  2. Missing value imputation using KNN imputer
  3. Feature encoding with target encoding for high-cardinality categoricals
  4. Hyperparameter optimization with Optuna (200 trials)
  5. Model evaluation with cross-validation and calibration curves

Lessons Learned

Working with imbalanced datasets required careful use of SMOTE and threshold tuning. Learned that feature engineering has far more impact than model selection for tabular data.