← Back to experience
Enterprise consulting engagement Software Developer

High-Throughput Data Pipeline Platform

Designed and built a data pipeline platform processing over 10 million records daily for a client in the financial services sector, replacing a fragile batch ETL process that regularly missed SLAs.

  • Python
  • PostgreSQL
  • SQLAlchemy
  • Docker
  • Redis
  • SQL

The Problem

The client ran a nightly batch ETL job that ingested transaction data from
several upstream systems, transformed it, and loaded it into a reporting
warehouse. The job had grown organically over years and was:

  • Missing its four-hour SLA window on roughly 20% of nights
  • Nearly impossible to debug — a single Python script with 4 000 lines and
    no tests
  • Completely sequential; one failed upstream feed would block the entire run

Constraints

  • Could not change the upstream data sources or their formats
  • Warehouse schema was owned by a separate team and changed infrequently
  • Target throughput: 10 M+ records per day with a two-hour completion window
  • Compliance requirement: full audit trail for every record transformation

My Approach

I redesigned the pipeline around three principles:

  1. Parallelism by default: each upstream source became an independent
    pipeline stage, fanned out with Python’s concurrent.futures. A failed
    feed no longer blocked others.

  2. Idempotent, restartable stages: each transformation step wrote to a
    staging table keyed on a content hash. Re-running a failed stage was safe
    and cheap.

  3. Observability first: structured logging with a correlation ID per
    pipeline run, plus a lightweight status table queryable by the ops team.
    Mean time to diagnose a failure dropped from hours to minutes.

Results

  • SLA compliance improved from ~80% to 99.6% over the first quarter after
    launch
  • End-to-end runtime for a full daily load fell from an average of six hours
    to under ninety minutes
  • The audit trail satisfied a subsequent compliance review with no
    remediation required
  • Two junior developers maintained and extended the platform without needing
    support from me after a two-week handover

What I’d Do Differently

The parallelism model worked well but was implemented with Python threads and
processes rather than a proper task queue. For the volume at the time this
was fine, but as data volumes grew the team eventually needed to migrate to a
proper orchestration tool. I would have introduced something like Celery or
a lightweight workflow engine from the start to avoid that migration cost.


This demonstrates

  • Data engineering
  • Performance optimisation
  • Observability
  • Refactoring legacy systems
  • Parallel processing