Building Smart Retry Strategies in Rails with Error-Aware Delays
A recent Rails change lets your job retry logic inspect the actual error that occurred. This opens up retry strategies that were previously awkward to implement.
The Old Way
Before this change, retry_on wait procs only received the execution count:
class ApiJob < ApplicationJob
retry_on ApiError, wait: ->(executions) { executions ** 2 }
def perform(endpoint)
ExternalApi.call(endpoint)
end
end
This works for basic exponential backoff, but what if the API tells you exactly when to retry? Rate-limited APIs often include a Retry-After header. With only the execution count, you can’t access that information.
The New Way
PR #56601, just merged into Rails main, adds the error as an optional second argument. This will ship with Rails 8.2.
class ApiJob < ApplicationJob
retry_on ApiError, wait: ->(executions, error) { error.retry_after || executions ** 2 }
def perform(endpoint)
ExternalApi.call(endpoint)
end
end
Now you can inspect the error and make smart decisions. The change is backwards compatible—procs with arity 1 continue receiving only the execution count.
Patterns
Here are a few ways to use this in practice.
Pattern 1: Respect Rate Limits
When an API rate-limits you, it often tells you when to retry:
class RateLimitError < StandardError
attr_reader :retry_after
def initialize(message, retry_after: nil)
super(message)
@retry_after = retry_after
end
end
class SyncToStripeJob < ApplicationJob
retry_on RateLimitError,
wait: ->(executions, error) {
# Trust the API's guidance, with a sensible fallback
error.retry_after || (executions * 30.seconds)
},
attempts: 10
def perform(user)
response = Stripe::Customer.update(user.stripe_id, user.stripe_attributes)
rescue Stripe::RateLimitError => e
raise RateLimitError.new(e.message, retry_after: e.http_headers["retry-after"]&.to_i)
end
end
This respects the API’s backpressure signals instead of blindly hammering it.
Pattern 2: Extracting Retry Hints from Exception Messages
Some exceptions encode useful information in their message. For example, a lock timeout might tell you which resource was contested:
class LockTimeoutError < StandardError
attr_reader :lock_wait_time
def initialize(message, lock_wait_time: nil)
super(message)
@lock_wait_time = lock_wait_time
end
end
class ImportJob < ApplicationJob
retry_on LockTimeoutError,
wait: ->(executions, error) {
# If we know how long we waited for the lock, wait at least that long
# before retrying, plus some jitter
base_delay = error.lock_wait_time || executions ** 2
jitter = rand(0.0..1.0) * base_delay
base_delay + jitter
},
attempts: 5
def perform(batch)
Record.transaction do
batch.each { |row| Record.upsert(row) }
end
rescue ActiveRecord::LockWaitTimeout => e
# Extract wait time if your database adapter provides it
raise LockTimeoutError.new(e.message, lock_wait_time: extract_wait_time(e))
end
private
def extract_wait_time(error)
# Parse from error message or metadata if available
error.message[/waited (\d+)s/, 1]&.to_i
end
end
The retry delay now adapts to the actual contention observed.
Pattern 3: Context-Aware Delays Based on Error Details
Some errors carry context that should influence retry timing:
class WebhookDeliveryError < StandardError
attr_reader :status_code, :response_body
def initialize(message, status_code:, response_body: nil)
super(message)
@status_code = status_code
@response_body = response_body
end
def transient?
status_code.in?(500..599) || status_code == 429
end
def suggested_delay
case status_code
when 429 then 60.seconds # Rate limited, back off significantly
when 503 then 30.seconds # Service unavailable, moderate backoff
when 500..502, 504..599 then 10.seconds # Server errors, shorter delay
else 5.seconds
end
end
end
class DeliverWebhookJob < ApplicationJob
retry_on WebhookDeliveryError,
wait: ->(executions, error) {
error.suggested_delay * executions
},
attempts: 8
def perform(webhook)
response = HTTP.post(webhook.url, json: webhook.payload)
unless response.status.success?
raise WebhookDeliveryError.new(
"Webhook delivery failed",
status_code: response.status,
response_body: response.body.to_s
)
end
end
end
This treats a 503 differently from a 500, and both differently from a 429.
Pattern 4: Multi-Error Strategy with Shared Logic
For jobs that can fail in multiple ways, centralize your retry logic:
module RetryStrategies
STRATEGIES = {
rate_limit: ->(executions, error) {
error.respond_to?(:retry_after) ? error.retry_after : 60.seconds
},
transient: ->(executions, error) {
(2 ** executions) + rand(0..executions)
},
network: ->(executions, error) {
[5.seconds * executions, 2.minutes].min
}
}
def self.for(type)
STRATEGIES.fetch(type)
end
end
class ExternalSyncJob < ApplicationJob
retry_on RateLimitError, wait: RetryStrategies.for(:rate_limit), attempts: 10
retry_on Net::OpenTimeout, wait: RetryStrategies.for(:network), attempts: 5
retry_on Faraday::ServerError, wait: RetryStrategies.for(:transient), attempts: 5
def perform(record)
ExternalService.sync(record)
end
end
This keeps retry policies consistent across your application.
Error Classes That Carry Context
To get the most out of this, wrap external errors with useful context:
class ExternalApiError < StandardError
attr_reader :original_error, :retry_after, :retriable
def initialize(message, original_error: nil, retry_after: nil, retriable: true)
super(message)
@original_error = original_error
@retry_after = retry_after
@retriable = retriable
end
def self.from_response(response)
new(
"API returned #{response.status}",
retry_after: parse_retry_after(response),
retriable: response.status.in?(500..599) || response.status == 429
)
end
private_class_method def self.parse_retry_after(response)
value = response.headers["Retry-After"]
return nil unless value
if value.match?(/^\d+$/)
value.to_i.seconds
else
Time.httpdate(value) - Time.current rescue nil
end
end
end
Then your job can branch on those details:
class ApiSyncJob < ApplicationJob
retry_on ExternalApiError,
wait: ->(executions, error) {
error.retry_after || (executions ** 2).seconds
},
attempts: 10
def perform(resource)
response = ApiClient.sync(resource)
raise ExternalApiError.from_response(response) unless response.success?
end
end
Combining with discard_on
Not every error should be retried. Use discard_on for errors that will never succeed:
class ProcessPaymentJob < ApplicationJob
discard_on PaymentDeclinedError # Don't retry declined cards
retry_on PaymentGatewayError,
wait: ->(executions, error) {
error.retry_after || (10.seconds * executions)
},
attempts: 5
def perform(order)
PaymentGateway.charge(order)
end
end
Instead of treating all errors the same, you can now build retry strategies that adapt to the specific failure. Your jobs can back off when APIs tell them to, jitter their retries to avoid collisions, and fail fast when retrying won’t help.
This feature will be available in Rails 8.2.