Rails retry_on: Build Smart Retry Strategies with Error-Aware Delays

A recent Rails change lets your job retry logic inspect the actual error that occurred. This opens up retry strategies that were previously awkward to implement.

The Old Way

Before this change, retry_on wait procs only received the execution count:

class ApiJob < ApplicationJob
  retry_on ApiError, wait: ->(executions) { executions ** 2 }

  def perform(endpoint)
    ExternalApi.call(endpoint)
  end
end

This works for basic exponential backoff, but what if the API tells you exactly when to retry? Rate-limited APIs often include a Retry-After header. With only the execution count, you can’t access that information.

The New Way

PR #56601, just merged into Rails main, adds the error as an optional second argument. This will ship with Rails 8.2.

class ApiJob < ApplicationJob
  retry_on ApiError, wait: ->(executions, error) { error.retry_after || executions ** 2 }

  def perform(endpoint)
    ExternalApi.call(endpoint)
  end
end

Now you can inspect the error and make smart decisions. The change is backwards compatible—procs with arity 1 continue receiving only the execution count.

Patterns

Here are a few ways to use this in practice.

Pattern 1: Respect Rate Limits

When an API rate-limits you, it often tells you when to retry:

class RateLimitError < StandardError
  attr_reader :retry_after

  def initialize(message, retry_after: nil)
    super(message)
    @retry_after = retry_after
  end
end

class SyncToStripeJob < ApplicationJob
  retry_on RateLimitError,
    wait: ->(executions, error) {
      # Trust the API's guidance, with a sensible fallback
      error.retry_after || (executions * 30.seconds)
    },
    attempts: 10

  def perform(user)
    response = Stripe::Customer.update(user.stripe_id, user.stripe_attributes)
  rescue Stripe::RateLimitError => e
    raise RateLimitError.new(e.message, retry_after: e.http_headers["retry-after"]&.to_i)
  end
end

This respects the API’s backpressure signals instead of blindly hammering it.

Pattern 2: Extracting Retry Hints from Exception Messages

Some exceptions encode useful information in their message. For example, a lock timeout might tell you which resource was contested:

class LockTimeoutError < StandardError
  attr_reader :lock_wait_time

  def initialize(message, lock_wait_time: nil)
    super(message)
    @lock_wait_time = lock_wait_time
  end
end

class ImportJob < ApplicationJob
  retry_on LockTimeoutError,
    wait: ->(executions, error) {
      # If we know how long we waited for the lock, wait at least that long
      # before retrying, plus some jitter
      base_delay = error.lock_wait_time || executions ** 2
      jitter = rand(0.0..1.0) * base_delay
      base_delay + jitter
    },
    attempts: 5

  def perform(batch)
    Record.transaction do
      batch.each { |row| Record.upsert(row) }
    end
  rescue ActiveRecord::LockWaitTimeout => e
    # Extract wait time if your database adapter provides it
    raise LockTimeoutError.new(e.message, lock_wait_time: extract_wait_time(e))
  end

  private

  def extract_wait_time(error)
    # Parse from error message or metadata if available
    error.message[/waited (\d+)s/, 1]&.to_i
  end
end

The retry delay now adapts to the actual contention observed.

Pattern 3: Context-Aware Delays Based on Error Details

Some errors carry context that should influence retry timing:

class WebhookDeliveryError < StandardError
  attr_reader :status_code, :response_body

  def initialize(message, status_code:, response_body: nil)
    super(message)
    @status_code = status_code
    @response_body = response_body
  end

  def transient?
    status_code.in?(500..599) || status_code == 429
  end

  def suggested_delay
    case status_code
    when 429 then 60.seconds  # Rate limited, back off significantly
    when 503 then 30.seconds  # Service unavailable, moderate backoff
    when 500..502, 504..599 then 10.seconds  # Server errors, shorter delay
    else 5.seconds
    end
  end
end

class DeliverWebhookJob < ApplicationJob
  retry_on WebhookDeliveryError,
    wait: ->(executions, error) {
      error.suggested_delay * executions
    },
    attempts: 8

  def perform(webhook)
    response = HTTP.post(webhook.url, json: webhook.payload)

    unless response.status.success?
      raise WebhookDeliveryError.new(
        "Webhook delivery failed",
        status_code: response.status,
        response_body: response.body.to_s
      )
    end
  end
end

This treats a 503 differently from a 500, and both differently from a 429.

Pattern 4: Multi-Error Strategy with Shared Logic

For jobs that can fail in multiple ways, centralize your retry logic:

module RetryStrategies
  STRATEGIES = {
    rate_limit: ->(executions, error) {
      error.respond_to?(:retry_after) ? error.retry_after : 60.seconds
    },
    transient: ->(executions, error) {
      (2 ** executions) + rand(0..executions)
    },
    network: ->(executions, error) {
      [5.seconds * executions, 2.minutes].min
    }
  }

  def self.for(type)
    STRATEGIES.fetch(type)
  end
end

class ExternalSyncJob < ApplicationJob
  retry_on RateLimitError, wait: RetryStrategies.for(:rate_limit), attempts: 10
  retry_on Net::OpenTimeout, wait: RetryStrategies.for(:network), attempts: 5
  retry_on Faraday::ServerError, wait: RetryStrategies.for(:transient), attempts: 5

  def perform(record)
    ExternalService.sync(record)
  end
end

This keeps retry policies consistent across your application.

Error Classes That Carry Context

To get the most out of this, wrap external errors with useful context:

class ExternalApiError < StandardError
  attr_reader :original_error, :retry_after, :retriable

  def initialize(message, original_error: nil, retry_after: nil, retriable: true)
    super(message)
    @original_error = original_error
    @retry_after = retry_after
    @retriable = retriable
  end

  def self.from_response(response)
    new(
      "API returned #{response.status}",
      retry_after: parse_retry_after(response),
      retriable: response.status.in?(500..599) || response.status == 429
    )
  end

  private_class_method def self.parse_retry_after(response)
    value = response.headers["Retry-After"]
    return nil unless value

    if value.match?(/^\d+$/)
      value.to_i.seconds
    else
      Time.httpdate(value) - Time.current rescue nil
    end
  end
end

Then your job can branch on those details:

class ApiSyncJob < ApplicationJob
  retry_on ExternalApiError,
    wait: ->(executions, error) {
      error.retry_after || (executions ** 2).seconds
    },
    attempts: 10

  def perform(resource)
    response = ApiClient.sync(resource)
    raise ExternalApiError.from_response(response) unless response.success?
  end
end

Combining with discard_on

Not every error should be retried. Use discard_on for errors that will never succeed:

class ProcessPaymentJob < ApplicationJob
  discard_on PaymentDeclinedError  # Don't retry declined cards

  retry_on PaymentGatewayError,
    wait: ->(executions, error) {
      error.retry_after || (10.seconds * executions)
    },
    attempts: 5

  def perform(order)
    PaymentGateway.charge(order)
  end
end

Instead of treating all errors the same, you can now build retry strategies that adapt to the specific failure. Your jobs can back off when APIs tell them to, jitter their retries to avoid collisions, and fail fast when retrying won’t help.

If your jobs run inside database transactions, make sure they’re also protected against race conditions with enqueue_after_transaction_commit, which ensures jobs wait for the transaction to commit before dispatching. And for jobs that need to process large datasets across multiple executions, take a look at building resumable jobs with ActiveJob::Continuable.

This feature will be available in Rails 8.2.