High-Throughput Data Pipeline Platform
Designed and built a data pipeline platform processing over 10 million records daily for a client in the financial services sector, replacing a fragile batch ETL process that regularly missed SLAs.
The Problem
The client ran a nightly batch ETL job that ingested transaction data from
several upstream systems, transformed it, and loaded it into a reporting
warehouse. The job had grown organically over years and was:
- Missing its four-hour SLA window on roughly 20% of nights
- Nearly impossible to debug — a single Python script with 4 000 lines and
no tests - Completely sequential; one failed upstream feed would block the entire run
Constraints
- Could not change the upstream data sources or their formats
- Warehouse schema was owned by a separate team and changed infrequently
- Target throughput: 10 M+ records per day with a two-hour completion window
- Compliance requirement: full audit trail for every record transformation
My Approach
I redesigned the pipeline around three principles:
-
Parallelism by default: each upstream source became an independent
pipeline stage, fanned out with Python’sconcurrent.futures. A failed
feed no longer blocked others. -
Idempotent, restartable stages: each transformation step wrote to a
staging table keyed on a content hash. Re-running a failed stage was safe
and cheap. -
Observability first: structured logging with a correlation ID per
pipeline run, plus a lightweight status table queryable by the ops team.
Mean time to diagnose a failure dropped from hours to minutes.
Results
- SLA compliance improved from ~80% to 99.6% over the first quarter after
launch - End-to-end runtime for a full daily load fell from an average of six hours
to under ninety minutes - The audit trail satisfied a subsequent compliance review with no
remediation required - Two junior developers maintained and extended the platform without needing
support from me after a two-week handover
What I’d Do Differently
The parallelism model worked well but was implemented with Python threads and
processes rather than a proper task queue. For the volume at the time this
was fine, but as data volumes grew the team eventually needed to migrate to a
proper orchestration tool. I would have introduced something like Celery or
a lightweight workflow engine from the start to avoid that migration cost.