I recently refactored a data pipeline: 14 sequential scripts, 6 intermediate CSV files, three separate API calls to the same source.
It worked. But it was one upstream schema change away from breaking.
The actual problem wasn't the code
When I drew out the pipeline, the issue became obvious. We were pulling data from Shopify three separate times, flattening the responses in Python, writing the results to CSV files, then loading it into PostgreSQL, where we immediately start joining it back together.
We were doing in Python what the database was designed to do natively.
The 14 scripts weren't the disease. They were a symptom of one design decision made early on: using Python as the transformation layer. Over time, that decision had compounded.
The book that changed how I saw the system
I was reading Martin Kleppmann's Designing Data-Intensive Applications during this period. Two ideas landed differently once I had a real system in front of me:
The old pipeline enforced schema at write time. Python needed to know the exact structure of the JSON before it could flatten it. Any change from Shopify's API broke the script.
If instead you store the raw JSON and let the database read from it, you can handle schema evolution gracefully. New field in the API response? It's already in your database. Just update the query that reads it.
We were pulling raw JSON into Python, transforming it, then loading it into PostgreSQL. PostgreSQL is built for transforming data. The right move was to store the raw JSON and let SQL do the work.
There's a gap between understanding these ideas in the abstract and seeing exactly where they apply in a system you're responsible for. That gap is where the work is.
What the refactor looked like
- 3 separate API extractions (orders via REST, products via REST, orders via GraphQL)
- Python flattening scripts to normalize the nested JSON
- CSV as an intermediate format
- 14 scripts running sequentially
- A single GraphQL query fetching orders with all nested data in one call
- Raw JSON inserted directly into PostgreSQL as a JSONB column
- SQL transforms that extract, cast, and join data at query time
- 3 scripts total
The results
If Shopify changes their response format, the raw data is already in the database. The only adjustment is updating the SQL, not the extraction code.
The tradeoffs I accepted
Every architectural decision introduces tradeoffs. Here are the ones I consciously accepted:
Data is as fresh as the last ETL run, not live. For this use case (supplier cost dashboards), that's fine. If the business needed sub-minute freshness, I'd revisit it.
The output table has one row per line item, which means order-level fields (customer name, shipping address) repeat across rows. That's a storage tradeoff I accepted because it makes dashboard queries simpler — no complex joins at read time.
PostgreSQL's JSONB functions aren't something every engineer knows. I wrote documentation. That's the cost of using the right tool.
What I'd take to the next project
Preserve raw data. Defer transformation.
Storing the raw API response as JSONB means the database is the source of truth, not a derived artifact. You can always reprocess it. You can't go back and re-flatten a CSV that's already been overwritten.
One more lesson: draw the system before you touch it.
The 14-script pipeline was built incrementally, one reasonable addition at a time. The complexity was invisible until the entire system was mapped out on paper.