Build a File Deduplication System with ActiveStorage
Every time a user uploads their company logo, profile picture, or that same PDF they’ve uploaded three times before, you’re paying to store it again. ActiveStorage doesn’t deduplicate by default—each upload creates a new blob, even if the content is identical.
Let’s fix that. We’ll build a deduplication system that detects identical files and reuses existing blobs, saving storage costs and making uploads instant for duplicates. We’ll scope deduplication to each user—so users only dedupe against their own uploads, keeping files secure.
How ActiveStorage Checksums Work
Every ActiveStorage blob has a checksum column, which is an MD5 hash of the file content. Two files with identical content will always have the same checksum:
ActiveStorage::Blob.pluck(:checksum).tally
# => {"vckNNU4TN7zbzl+o3tjXPQ==" => 47, "x8K9f2mVhLpWQ..." => 12, ...}
If you see counts greater than 1, you have duplicates. Let’s eliminate them.
Adding an Index for Performance
ActiveStorage doesn’t index the checksum column by default. Without an index, deduplication queries will do full table scans—fine for small apps, but slow at scale.
Note: This migration modifies ActiveStorage’s schema. While adding an index is low-risk (no structural changes), be aware you’re customizing an engine-owned table. Document this decision for your team.
class AddIndexToActiveStorageBlobsChecksum < ActiveRecord::Migration[8.0]
disable_ddl_transaction!
def change
add_index :active_storage_blobs, :checksum, algorithm: :concurrently
end
end
The algorithm: :concurrently option (PostgreSQL) builds the index without locking writes, which is essential for production databases with existing data. It requires disable_ddl_transaction! since concurrent index creation can’t run inside a transaction.
Step 1: Server-Side Deduplication
When creating a blob, check if the current user already has one with that checksum.
Create a controller to handle deduplicated direct uploads:
# app/controllers/deduplicated_uploads_controller.rb
class DeduplicatedUploadsController < ActiveStorage::DirectUploadsController
before_action :authenticate_user!
def create
existing_blob = find_existing_blob_for_user(
blob_params[:checksum],
blob_params[:byte_size]
)
if existing_blob
render json: existing_blob_json(existing_blob)
else
super
end
end
private
def find_existing_blob_for_user(checksum, byte_size)
return nil if checksum.blank? || byte_size.blank?
# Only find blobs the current user has uploaded before
# Uses a single efficient query with EXISTS subqueries
# Matching both checksum AND byte_size prevents collision attacks
ActiveStorage::Blob
.joins(:attachments)
.where(checksum: checksum, byte_size: byte_size)
.where(user_owns_attachment_sql)
.first
end
# SQL fragment that checks if current_user owns the attached record
# Uses EXISTS subqueries for efficiency (no loading IDs into memory)
def user_owns_attachment_sql
<<~SQL.squish
(
(active_storage_attachments.record_type = 'Document'
AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{current_user.id}))
OR
(active_storage_attachments.record_type = 'Avatar'
AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{current_user.id}))
)
SQL
end
def existing_blob_json(blob)
{
id: blob.id,
key: blob.key,
filename: blob.filename.to_s,
content_type: blob.content_type,
byte_size: blob.byte_size,
checksum: blob.checksum,
signed_id: blob.signed_id,
direct_upload: nil # Signal to skip upload
}
end
def blob_params
params.require(:blob).permit(:filename, :byte_size, :checksum, :content_type, metadata: {})
end
end
This queries through the active_storage_attachments table to find blobs attached to records the current user owns. Adjust attachable_types and user_record_ids to match your application’s ownership model.
Add the route:
# config/routes.rb
Rails.application.routes.draw do
post '/rails/active_storage/direct_uploads',
to: 'deduplicated_uploads#create',
as: :deduplicated_direct_uploads
end
How the Attachment Works
When we find an existing blob, we return its signed_id. The client submits this signed_id with the form, and ActiveStorage creates a new attachment pointing to the existing blob:
# app/controllers/documents_controller.rb
class DocumentsController < ApplicationController
def create
@document = current_user.documents.build(document_params)
if @document.save
redirect_to @document
else
render :new
end
end
private
def document_params
params.require(:document).permit(:title, :file)
end
end
When params[:document][:file] is a signed_id string (not an uploaded file), ActiveStorage finds the blob and attaches it. One blob can have many attachments:
blob = ActiveStorage::Blob.find_by(checksum: "vckNNU4TN7zbzl+o3tjXPQ==")
blob.attachments.count
# => 5 (five different records share this blob)
This is the key to deduplication: multiple records reference the same stored file.
Step 2: Client-Side Upload Handling
The client needs to know when to skip the upload. When direct_upload is null in the response, the file already exists:
// app/javascript/controllers/deduplicated_upload_controller.js
import { Controller } from "@hotwired/stimulus"
import { DirectUpload } from "@rails/activestorage"
export default class extends Controller {
static targets = ["input", "progress"]
static values = { url: String }
upload() {
const file = this.inputTarget.files[0]
if (!file) return
const upload = new DirectUpload(file, this.urlValue, this)
upload.create((error, blob) => {
if (error) {
console.error(error)
} else {
this.handleSuccess(blob)
}
})
}
handleSuccess(blob) {
// Create hidden input with signed_id
const input = document.createElement("input")
input.type = "hidden"
input.name = this.inputTarget.name
input.value = blob.signed_id
this.inputTarget.form.appendChild(input)
if (blob.direct_upload === null) {
this.showMessage("File already exists - instant upload!")
} else {
this.showMessage("Upload complete")
}
}
// DirectUpload delegate methods
directUploadWillStoreFileWithXHR(request) {
request.upload.addEventListener("progress", (event) => {
const progress = (event.loaded / event.total) * 100
this.progressTarget.style.width = `${progress}%`
})
}
showMessage(text) {
// Update UI to show status
this.progressTarget.textContent = text
}
}
Use it in your form:
<%= form_with model: @document do |f| %>
<div data-controller="deduplicated-upload"
data-deduplicated-upload-url-value="<%= deduplicated_direct_uploads_url %>">
<%= f.file_field :file,
data: {
deduplicated_upload_target: "input",
action: "change->deduplicated-upload#upload"
} %>
<div data-deduplicated-upload-target="progress"></div>
</div>
<%= f.submit %>
<% end %>
Step 3: Skip the Network Request Entirely
We can go further. Compute the checksum client-side and check if the blob exists before attempting upload:
Extract the scoping logic into a concern to share between controllers:
# app/controllers/concerns/blob_scoping.rb
module BlobScoping
extend ActiveSupport::Concern
def find_user_blob(checksum, byte_size)
return nil if checksum.blank? || byte_size.blank?
ActiveStorage::Blob
.joins(:attachments)
.where(checksum: checksum, byte_size: byte_size)
.where(user_owns_attachment_sql)
.first
end
private
def user_owns_attachment_sql
<<~SQL.squish
(
(active_storage_attachments.record_type = 'Document'
AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{current_user.id}))
OR
(active_storage_attachments.record_type = 'Avatar'
AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{current_user.id}))
)
SQL
end
end
Then use it in the lookup controller:
# app/controllers/blob_lookups_controller.rb
class BlobLookupsController < ApplicationController
include BlobScoping
before_action :authenticate_user!
def show
blob = find_user_blob(params[:checksum], params[:byte_size].to_i)
if blob
render json: { exists: true, signed_id: blob.signed_id }
else
render json: { exists: false }
end
end
end
# config/routes.rb
# Use query param for checksum (base64 contains +, /, = which need encoding in paths)
get '/blobs/lookup', to: 'blob_lookups#show', as: :blob_lookup
Update the controller to read from query params:
# app/controllers/blob_lookups_controller.rb
def show
blob = find_user_blob(params[:checksum], params[:byte_size].to_i)
# ... rest unchanged
end
Now the JavaScript can check first:
// app/javascript/controllers/smart_upload_controller.js
import { Controller } from "@hotwired/stimulus"
import { DirectUpload, FileChecksum } from "@rails/activestorage"
export default class extends Controller {
static targets = ["input", "status"]
static values = {
lookupUrl: String,
uploadUrl: String
}
async upload() {
const file = this.inputTarget.files[0]
if (!file) return
try {
this.statusTarget.textContent = "Computing checksum..."
const checksum = await this.computeChecksum(file)
this.statusTarget.textContent = "Checking for duplicates..."
const existing = await this.lookupBlob(checksum, file.size)
if (existing.exists) {
this.statusTarget.textContent = "File already uploaded - instant!"
this.attachSignedId(existing.signed_id)
return
}
this.statusTarget.textContent = "Uploading..."
await this.performUpload(file)
} catch (error) {
this.statusTarget.textContent = `Error: ${error.message}`
console.error("Upload failed:", error)
}
}
computeChecksum(file) {
return new Promise((resolve, reject) => {
FileChecksum.create(file, (error, checksum) => {
if (error) {
reject(new Error(`Checksum failed: ${error}`))
} else {
resolve(checksum)
}
})
})
}
async lookupBlob(checksum, byteSize) {
const url = new URL(this.lookupUrlValue, window.location.origin)
url.searchParams.set("checksum", checksum)
url.searchParams.set("byte_size", byteSize)
const response = await fetch(url, {
headers: {
"X-CSRF-Token": this.csrfToken,
"Accept": "application/json"
},
credentials: "same-origin"
})
if (!response.ok) {
throw new Error(`Lookup failed: ${response.status}`)
}
return response.json()
}
get csrfToken() {
const meta = document.querySelector('meta[name="csrf-token"]')
return meta ? meta.content : ""
}
attachSignedId(signedId) {
const input = document.createElement("input")
input.type = "hidden"
input.name = this.inputTarget.name
input.value = signedId
this.inputTarget.form.appendChild(input)
}
performUpload(file) {
return new Promise((resolve, reject) => {
const upload = new DirectUpload(file, this.uploadUrlValue, this)
upload.create((error, blob) => {
if (error) {
reject(new Error(`Upload failed: ${error}`))
} else {
this.statusTarget.textContent = "Upload complete"
this.attachSignedId(blob.signed_id)
resolve(blob)
}
})
})
}
directUploadWillStoreFileWithXHR(request) {
request.upload.addEventListener("progress", (event) => {
const percent = Math.round((event.loaded / event.total) * 100)
this.statusTarget.textContent = `Uploading: ${percent}%`
})
}
}
Expanding Scope
User-scoped deduplication is the safest default, but you might want broader deduplication in some cases. Here are options for expanding scope:
Option A: Organization/tenant scope
For SaaS apps where teammates share files, deduplicate within the organization. This requires querying records owned by any organization member:
def find_org_blob(checksum, byte_size)
return nil if checksum.blank? || byte_size.blank?
ActiveStorage::Blob
.joins(:attachments)
.where(checksum: checksum, byte_size: byte_size)
.where(org_owns_attachment_sql)
.first
end
def org_owns_attachment_sql
# Find blobs attached to records owned by any org member
<<~SQL.squish
(
(active_storage_attachments.record_type = 'Document'
AND EXISTS (
SELECT 1 FROM documents
WHERE documents.id = active_storage_attachments.record_id
AND documents.organization_id = #{current_organization.id}
))
OR
(active_storage_attachments.record_type = 'Project'
AND EXISTS (
SELECT 1 FROM projects
WHERE projects.id = active_storage_attachments.record_id
AND projects.organization_id = #{current_organization.id}
))
)
SQL
end
Option B: Public files only
Allow global deduplication for files explicitly marked public:
def find_public_blob(checksum, byte_size)
return nil if checksum.blank? || byte_size.blank?
ActiveStorage::Blob
.where(checksum: checksum, byte_size: byte_size)
.where("metadata->>'public' = ?", "true")
.first
end
def find_existing_blob(checksum, byte_size)
find_user_blob(checksum, byte_size) || find_public_blob(checksum, byte_size)
end
Option C: Content-type based (server-side only)
Deduplicate “safe” content like images globally, but keep documents user-scoped. This only works for server-side deduplication (Step 1) where we have the content_type from the upload request—it won’t work with the pre-upload lookup (Step 3):
# In DeduplicatedUploadsController
def find_existing_blob_for_user(checksum, byte_size)
return nil if checksum.blank? || byte_size.blank?
content_type = blob_params[:content_type]
if content_type&.start_with?("image/")
# Images can be deduplicated globally (low risk)
ActiveStorage::Blob.find_by(checksum: checksum, byte_size: byte_size)
else
# Documents stay user-scoped
find_user_blob(checksum, byte_size)
end
end
Handling Edge Cases
Race conditions: The same user uploads a file twice in quick succession. Both checksums come back as “not found,” both upload. You end up with two blobs. This is fine. The user-scoped query already handles this naturally, and you can clean up duplicates per-user with a background job:
# app/jobs/deduplicate_user_blobs_job.rb
class DeduplicateUserBlobsJob < ApplicationJob
def perform(user)
duplicates = find_duplicate_checksums_for(user)
duplicates.each do |checksum|
deduplicate_blobs_with_checksum(user, checksum)
end
end
private
def find_duplicate_checksums_for(user)
ActiveStorage::Blob
.joins(:attachments)
.where(user_owns_attachment_sql(user))
.group(:checksum)
.having("COUNT(DISTINCT active_storage_blobs.id) > 1")
.pluck(:checksum)
end
def deduplicate_blobs_with_checksum(user, checksum)
ActiveStorage::Blob.transaction do
blobs = ActiveStorage::Blob
.joins(:attachments)
.where(checksum: checksum)
.where(user_owns_attachment_sql(user))
.order(:created_at)
.lock("FOR UPDATE")
.distinct
canonical = blobs.first
return if canonical.nil?
# Use where.not instead of offset (offset doesn't work with find_each)
blobs.where.not(id: canonical.id).find_each do |duplicate|
duplicate.attachments.update_all(blob_id: canonical.id)
duplicate.purge
end
end
end
def user_owns_attachment_sql(user)
<<~SQL.squish
(
(active_storage_attachments.record_type = 'Document'
AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{user.id}))
OR
(active_storage_attachments.record_type = 'Avatar'
AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{user.id}))
)
SQL
end
end
Orphaned blobs: The find_user_blob method already joins through attachments, so orphaned blobs (those with no attachments) are automatically excluded.
Different filenames: Same content, different names. The blob stores the original filename, but attachments can override it. This is fine—dedupe on content, not metadata.
Preventing Premature Blob Deletion
There’s a critical issue with shared blobs: by default, ActiveStorage purges the blob when you delete a record. If Document A and Document B share a blob, deleting Document A would delete the blob—breaking Document B.
Fix this by disabling automatic purging on your models:
# app/models/document.rb
class Document < ApplicationRecord
belongs_to :user
has_one_attached :file, dependent: false # Don't auto-purge
end
# app/models/avatar.rb
class Avatar < ApplicationRecord
belongs_to :user
has_one_attached :image, dependent: false
end
Now blobs persist even when their attachments are deleted. Clean up orphaned blobs (those with zero attachments) with a scheduled job:
# app/jobs/cleanup_orphaned_blobs_job.rb
class CleanupOrphanedBlobsJob < ApplicationJob
def perform
# Find blobs with no attachments, older than 1 day (grace period)
ActiveStorage::Blob
.left_joins(:attachments)
.where(active_storage_attachments: { id: nil })
.where(active_storage_blobs: { created_at: ...1.day.ago })
.find_each(&:purge)
end
end
Schedule it to run daily. With Solid Queue (Rails 8 default):
# config/recurring.yml
cleanup_orphaned_blobs:
class: CleanupOrphanedBlobsJob
schedule: every day at 3am
Or with sidekiq-cron:
# config/initializers/sidekiq.rb
Sidekiq::Cron::Job.create(
name: "Cleanup orphaned blobs - daily",
cron: "0 3 * * *",
class: "CleanupOrphanedBlobsJob"
)
The 1-day grace period prevents race conditions where a blob is created but not yet attached.
Measuring Impact
Track your deduplication rate:
# In your controller
def create
existing_blob = find_user_blob(blob_params[:checksum])
if existing_blob
Rails.logger.info "[Dedup] Reused blob #{existing_blob.id} for user #{current_user.id} (#{existing_blob.byte_size} bytes saved)"
StatsD.increment("uploads.deduplicated")
StatsD.count("uploads.bytes_saved", existing_blob.byte_size)
# ...
end
end
Check duplication potential per user:
def duplication_stats_for(user)
# Find duplicate checksums for this user's blobs
duplicate_checksums = ActiveStorage::Blob
.joins(:attachments)
.where(user_owns_attachment_sql(user))
.group(:checksum)
.having("COUNT(DISTINCT active_storage_blobs.id) > 1")
.pluck(:checksum)
return { duplicates: 0, wasted_bytes: 0 } if duplicate_checksums.empty?
# Calculate wasted space
duplicate_blobs = ActiveStorage::Blob
.joins(:attachments)
.where(checksum: duplicate_checksums)
.where(user_owns_attachment_sql(user))
.distinct
total_bytes = duplicate_blobs.sum(:byte_size)
unique_bytes = duplicate_blobs.select("DISTINCT ON (checksum) *").sum(&:byte_size)
{
duplicates: duplicate_blobs.count - duplicate_checksums.size,
wasted_bytes: total_bytes - unique_bytes
}
end
def user_owns_attachment_sql(user)
<<~SQL.squish
(
(active_storage_attachments.record_type = 'Document'
AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{user.id}))
OR
(active_storage_attachments.record_type = 'Avatar'
AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{user.id}))
)
SQL
end
Security Considerations
File deduplication introduces attack surfaces worth understanding. Here’s how to mitigate them:
Checksum collisions
ActiveStorage uses MD5 by default. While MD5 is fast, it’s cryptographically broken—collisions can be generated, though they’re still expensive (~$50k for a chosen-prefix collision). We mitigate this by:
- Requiring both checksum AND byte_size to match (implemented throughout this tutorial). An attacker would need to create a collision with the exact same file size.
- Scoping to the current user’s files. Even with a collision, you can only access your own blobs.
For stronger guarantees, Rails 8.2 adds support for SHA256 checksums via PR #54123. Configure it per-service:
# config/storage.yml
amazon:
service: S3
bucket: your-bucket
checksum_algorithm: SHA256 # Instead of default MD5
If you’re building a new app or in a FIPS-compliant environment, prefer SHA256.
Rate limiting lookups
The blob lookup endpoint could be abused to enumerate what files exist. Add rate limiting:
# app/controllers/blob_lookups_controller.rb
class BlobLookupsController < ApplicationController
include BlobScoping
before_action :authenticate_user!
# Using Rack::Attack or similar
# Limit to 60 lookups per minute per user
rate_limit to: 60, within: 1.minute, by: -> { current_user.id }
def show
blob = find_user_blob(params[:checksum], params[:byte_size].to_i)
if blob
render json: { exists: true, signed_id: blob.signed_id }
else
render json: { exists: false }
end
end
end
With Rails 8’s built-in rate limiting, this is straightforward. For older Rails versions, use Rack::Attack.
Expanding scope carefully
The “Expanding Scope” options above progressively increase risk. User-scoped deduplication is safe by default. Organization scope is reasonable for trusted teams. Global deduplication (even for “safe” content types) should only be used when you fully understand the implications—an attacker who knows a file’s checksum and size could potentially confirm its existence or obtain a signed URL.
Wrapping Up
Deduplication saves storage costs and makes uploads feel instant for returning files. The key insight: ActiveStorage already computes checksums—we just need to use them.
By scoping to the current user and matching both checksum and byte_size, you get the storage savings without security risks. Users can only deduplicate against their own uploads, preventing them from claiming access to files they shouldn’t have.
Start with server-side deduplication in the controller. Add client-side lookup if you want to skip uploads entirely. Expand scope to organization or public files only when you have a clear use case and understand the security tradeoffs.
For new applications, consider configuring SHA256 checksums in Rails 8.2+ for stronger integrity guarantees.
Acknowledgements
Thanks to @juliknl (blog) for pointing out the MD5 collision risks and recommending the byte_size check.