Weather Data Pipeline
An automated ETL pipeline that collects weather data from multiple APIs, processes it with Python, and generates daily forecast reports.
Overview
An automated data pipeline that aggregates weather data from multiple public APIs, stores it in a PostgreSQL database, and generates daily summary reports. Built to practice data engineering fundamentals with real-world data sources.
Key Features
- Multi-Source Ingestion: Pulls data from OpenWeatherMap, WeatherAPI, and government meteorological services
- Scheduled Runs: Apache Airflow DAGs for hourly data collection and daily report generation
- Data Quality Checks: Automated validation for missing values, outliers, and source discrepancies
- Reporting: Daily PDF reports with temperature trends, precipitation, and forecast accuracy
- Containerized: Fully dockerized for easy deployment
Technical Details
The pipeline is orchestrated with Apache Airflow running in Docker containers. Python scripts handle API calls, data transformation, and report generation. PostgreSQL stores both raw and processed data with time-series partitioning.
Pipeline Architecture
- Collect: Airflow tasks fetch data from 3 weather APIs every hour
- Validate: Data quality checks flag anomalies and missing readings
- Transform: Normalize units, interpolate gaps, compute daily aggregates
- Store: Insert into partitioned PostgreSQL tables
- Report: Generate daily PDF summaries with matplotlib charts
Lessons Learned
Learned the importance of idempotent pipeline tasks and graceful handling of API rate limits and downtime. Time zone handling across different data sources was a recurring challenge.