A customer placed an order in production. The confirmation page loaded.
But the cook and bartender dashboards showed nothing. The order existed in the database: payment_status: pending
That was the problem.
The dashboards filter on payment_status = 'paid'. An order that never gets confirmed, never shows up.
No alert. No error. No food. Then the customer waits.
What actually happened
Staging worked. Production didn't.
Same code. Same Docker images. Different outcome.
The difference: staging had the correct database schema, but production didn't.
The root cause
The deploy workflow had this line:
cat ~/[app]/init-scripts/schema.sql | docker exec -i [app]-postgres-production \ psql -U [app]_user -d [app] || echo 'Schema may already exist'
That last part is the lie:
|| echo 'Schema may already exist'
If the file is missing → the pipeline prints a message and continues.
If the SQL has errors → same thing.
If psql can't connect → same thing.
The build stays green but the schema is not applied. And nobody knows.
I wrote a deploy step that skipped verification and moved on. I never checked if the schema applied. The app never checked if its tables existed. The health check returned HTTP 200 because the server booted, not because it worked.
The audit
I grepped both workflow files for || true, || echo, and unguarded curl calls.
Eight hits. Two broke production.
| Risk | Pattern | Impact |
|---|---|---|
| High | schema.sql || echo (staging) | DB schema not applied, app broken |
| High | schema.sql || echo (production) | DB schema not applied, app broken |
| Medium | ssh-keygen || echo (production) | Bad key undetected, deploy fails later |
| Low | grep || true (staging) | Expected behavior, no risk |
| Low | grep || true (production x2) | Expected behavior, no risk |
| Low | curl without -f flag (staging) | Cache not purged, stale content |
| Low | curl without -f flag (production) | Cache not purged, stale content |
Five were harmless. One caused confusion. Two broke production.
The fix
Three layers. Each catches what the others miss.
Stop on first failure
Added set -euo pipefail to every SSH block:
ssh user@server " set -euo pipefail docker login ghcr.io ... docker compose pull docker compose up -d "
| Flag | Effect |
|---|---|
-e | Exit immediately if any command fails |
-u | Treat unset variables as errors |
-o pipefail | Fail if any command in the pipe fails |
Without pipefail, a pipeline like cat file | psql only checks the exit code of psql. If cat fails because the file is missing, the pipe still reports success. psql receives empty input and exits 0.
With pipefail, the pipe fails if either side fails.
This alone would have caught the schema bug.
Replace silence with validation
Replace every || echo with proper error handling.
- •
|| echo - •
|| true - • Ignored errors
- • Green pipeline
- • Validation
- • Logs
- • Warnings
- • Explicit errors
Schema validation — before:
cat schema.sql | docker exec -i postgres psql ... || echo 'Schema may already exist'
After:
if [ ! -f ~/[app]/init-scripts/schema.sql ]; then echo 'ERROR: schema.sql not found! Did SCP step fail?' exit 1 fi echo 'Applying database schema...' cat ~/[app]/init-scripts/schema.sql | docker exec -i postgres psql ... echo 'Schema applied successfully'
SSH key validation — before:
ssh-keygen -l -f ~/.ssh/deploy_key || echo "Key validation failed"
After:
if ! ssh-keygen -l -f ~/.ssh/deploy_key; then echo "ERROR: SSH key validation failed!" echo "Check PRODUCTION_SSH_KEY secret format" exit 1 fi echo "SSH key validated successfully"
Validate before deploying
Added a validation step before any file gets to the server:
- name: Validate required files exist run: | echo "Validating deployment files..." if [ ! -f infrastructure/docker/docker-compose.production.yml ]; then echo "ERROR: docker-compose.production.yml not found" exit 1 fi if [ ! -d infrastructure/docker/init-scripts ]; then echo "ERROR: init-scripts directory not found" exit 1 fi if [ ! -f infrastructure/docker/init-scripts/schema.sql ]; then echo "ERROR: schema.sql not found" exit 1 fi echo "All required files present"
Catches missing files on the build runner, where the error message is obvious.
Tradeoffs
Not every || true is wrong. grep exits 1 when it finds zero matches. That's correct behavior, not an error. If your .env file has no STRIPE_ lines, grep -v '^STRIPE_' .env exits 1 and set -e kills the build.
The fix is a comment, not removal:
# || true is intentional: grep exits 1 when no matches found; # this is expected if .env has no STRIPE_ lines grep -v '^STRIPE_' .env > .env.tmp || true
The Cloudflare cache purge was a different case. After making it fail loudly, it blocked every deploy. The API token didn't have the right permissions.
A cache issue was killing deploys that had nothing to do with the application.
So I switched to continue-on-error: true with a warning:
- name: Purge Cloudflare Cache continue-on-error: true # TODO: fix API token permissions run: | if ! curl -sf ... purge_cache; then echo "WARNING: Cloudflare cache purge failed (non-blocking)" else echo "Cloudflare cache purged successfully" fi
There's a spectrum:
+ WARNING log
+ TODO comment
Silence looks like success.
Failure looks like a problem.
The right answer tells the truth.
Takeaway
Add set -euo pipefail to every shell block. Audit every || true and || echo.
For each one, ask: if this fails, should the build continue?
If yes, document it.
If no, let it fail.
Validate files before deploying them. If you allow a failure, log a warning. If a deploy step doesn't verify its own success, you're guessing it worked. Five commits fixed that.