Back to BlogCloud

Stop Retrying Rate-Limited APIs Inside Your Request Handler

A practical system design article on why retrying third-party API calls inside request handlers breaks under load, and how to replace it with a durable, observable worker pipeline.

system-designqueuesretriesidempotencyserverlessobservability
Stop Retrying Rate-Limited APIs Inside Your Request Handler

The Mistake

A very normal architecture mistake is doing third-party side effects directly inside your API handler, then stapling retries onto it when production gets messy.

You see this in billing syncs, CRM writes, email sends, webhook fan-out, Slack notifications, and "just one quick call" to some vendor API with a cute 429 policy and a very real outage budget.

The flow usually starts like this:

  • User hits POST /accounts/:id/sync

  • API reads the DB

  • API calls three external services

  • If one fails, the handler retries a few times

  • If the platform times out, the client retries too

  • Eventually somebody adds a flag called ENABLE_AGGRESSIVE_RETRY

That design works in staging because staging has no contention, no burst traffic, and no real rate limits. In production, it turns into a failure amplifier.

AWS's Builders Library makes the core point better than most architecture talks: retries are selfish. They spend more downstream capacity to improve your own success rate. If the dependency is already overloaded, retries make recovery slower, not faster.

Why It Breaks

The problem is not only "timeouts happen." The problem is that you combined too many responsibilities at the wrong boundary.

Your request handler is now doing all of this at once:

  • Accepting user intent

  • Performing business validation

  • Owning external retries

  • Absorbing vendor rate limits

  • Managing partial failure

  • Deciding what the user should see when work is still in flight

That boundary is too fat.

Failure Mode 1: Retries Multiply Across Layers

If your API retries, your SDK retries, your queue consumer retries, and the client retries, you do not have resilience. You have retry fan-out.

AWS gives the classic example: retries at multiple layers can multiply load dramatically. A five-deep stack with three retries per layer can hammer the bottom dependency 243x harder during failure. That is how a partial outage becomes a full one.

Failure Mode 2: Duplicate Side Effects

Most real systems are at-least-once somewhere.

Amazon SQS says it plainly: design consumers to be idempotent because messages can be delivered more than once. Cloudflare Queues says the same thing and also explains the tradeoff: stronger delivery guarantees cost latency and throughput.

So if your handler says "timeout means try again," but your first call might already have succeeded, congratulations: you built a duplicate sender.

This is why Stripe's idempotency model is worth copying conceptually even when you are not building payments. The client supplies a unique key, the server records it atomically with the mutation, and later retries return the same effective result instead of doing the work twice.

Failure Mode 3: Rate Limits Become User-Facing Latency

Rate limits are not really request concerns. They are scheduling concerns.

Google Cloud Tasks queue config exposes this directly with dispatch rate and concurrent dispatch limits. That is the right mental model. If a dependency allows 20 writes per second, your system should shape work to 20 writes per second. It should not discover the limit one request at a time from angry 429s.

The Better Architecture

Split command acceptance from side-effect execution.

  • The API accepts intent quickly

  • The DB records the state change and an outbox event in one transaction

  • A worker or durable workflow publishes and executes side effects

  • Retries happen in one place

  • Rate limiting happens in one place

  • Status is observable and queryable

This is the boring architecture that keeps working.

Step 1: Make the API Boundary Small

Your API should answer: "Did I accept this command?" Not: "Did every external system finish all side effects before this HTTP timeout?"

Return something like:

{
  "jobId": "job_123",
  "status": "accepted"
}

If the caller needs synchronous UX, fine. Poll status, stream status, or send a webhook later. But stop pretending a third-party write is part of the same reliability domain as your request/response cycle.

Step 2: Use an Outbox, Not a Dual Write

The easy bug is:

  • Save business row to DB

  • Publish event to queue

If the DB commit succeeds and the publish fails, your state changed but the event vanished.

The transactional outbox pattern fixes that. Save the business mutation and the outbox record in the same transaction. A worker relays outbox entries later. Now your data flow is durable even when the broker or queue is briefly unavailable.

That is the difference between "event-driven" and "eventually inconsistent in a surprising way."

Step 3: Centralize Retries and Rate Limits in Workers

Workers should own:

  • Exponential backoff

  • Jitter

  • Max attempts

  • Dead-letter handling

  • Per-tenant concurrency caps

  • Vendor-specific throttles

This is where you implement fairness too. One noisy tenant should not consume the global retry budget and starve everyone else. Partition queues by tenant tier, shard by account, or keep per-tenant token buckets. The exact mechanism matters less than the principle: multi-tenant systems need isolation, not vibes.

Step 4: Prefer Durable Execution for Long or Fragile Flows

For multi-step jobs, plain queues are often too low-level. You end up rebuilding state machines in application code.

Tools like Cloudflare Workflows exist for a reason: durable steps, automatic retries, persisted state, waiting for external events, and built-in debugging. The point is not vendor fandom. The point is that serverless request lifetimes are a terrible place to hide long-running business processes.

What to Observe

If you cannot answer "where is job job_123 stuck and why?" your architecture is not done.

Use traces, metrics, and logs with shared identifiers. OpenTelemetry is the obvious baseline.

Track at least:

  • Queue depth

  • Oldest message age

  • Retry count by dependency

  • Success rate after retry

  • Dead-letter volume

  • Per-tenant backlog

  • Time from accepted to completed

  • Idempotency-key collisions or replays

The most useful dashboard in these systems is not request latency. It is work latency.

Tradeoffs

This architecture is better, not free.

  • You add operational components: workers, queues, schedulers, dead-letter drains.

  • You accept eventual consistency. The user may see "accepted" before side effects finish.

  • You need product language for in-progress states.

  • Debugging spans more components unless you instrument it properly.

  • Outbox cleanup and replay tooling become part of your platform surface area.

Still, these are good tradeoffs. You are moving complexity from the least reliable place, the request path, into explicit infrastructure that can be throttled, observed, replayed, and audited.

The Opinionated Rule

If a write depends on a rate-limited or failure-prone external system, do not make your API handler personally responsible for getting it all the way done.

Accept intent. Persist it durably. Execute side effects in workers. Make retries idempotent. Shape traffic before the vendor shapes it for you.

That architecture is less exciting than "edge-native real-time AI event mesh" or whatever this week's naming crime is. It is also the one that still works when the dependency is slow, your tenants are bursty, and somebody hits refresh five times.