Build a File Deduplication System with ActiveStorage

Every time a user uploads their company logo, profile picture, or that same PDF they’ve uploaded three times before, you’re paying to store it again. ActiveStorage doesn’t deduplicate by default—each upload creates a new blob, even if the content is identical.

Let’s fix that. We’ll build a deduplication system that detects identical files and reuses existing blobs, saving storage costs and making uploads instant for duplicates. We’ll scope deduplication to each user—so users only dedupe against their own uploads, keeping files secure.

How ActiveStorage Checksums Work

Every ActiveStorage blob has a checksum column, which is an MD5 hash of the file content. Two files with identical content will always have the same checksum:

ActiveStorage::Blob.pluck(:checksum).tally
# => {"vckNNU4TN7zbzl+o3tjXPQ==" => 47, "x8K9f2mVhLpWQ..." => 12, ...}

If you see counts greater than 1, you have duplicates. Let’s eliminate them.

Adding an Index for Performance

ActiveStorage doesn’t index the checksum column by default. Without an index, deduplication queries will do full table scans—fine for small apps, but slow at scale.

Note: This migration modifies ActiveStorage’s schema. While adding an index is low-risk (no structural changes), be aware you’re customizing an engine-owned table. Document this decision for your team.

class AddIndexToActiveStorageBlobsChecksum < ActiveRecord::Migration[8.0]
  disable_ddl_transaction!

  def change
    add_index :active_storage_blobs, :checksum, algorithm: :concurrently
  end
end

The algorithm: :concurrently option (PostgreSQL) builds the index without locking writes, which is essential for production databases with existing data. It requires disable_ddl_transaction! since concurrent index creation can’t run inside a transaction.

Step 1: Server-Side Deduplication

When creating a blob, check if the current user already has one with that checksum.

Create a controller to handle deduplicated direct uploads:

# app/controllers/deduplicated_uploads_controller.rb
class DeduplicatedUploadsController < ActiveStorage::DirectUploadsController
  before_action :authenticate_user!

  def create
    existing_blob = find_existing_blob_for_user(
      blob_params[:checksum],
      blob_params[:byte_size]
    )

    if existing_blob
      render json: existing_blob_json(existing_blob)
    else
      super
    end
  end

  private

  def find_existing_blob_for_user(checksum, byte_size)
    return nil if checksum.blank? || byte_size.blank?

    # Only find blobs the current user has uploaded before
    # Uses a single efficient query with EXISTS subqueries
    # Matching both checksum AND byte_size prevents collision attacks
    ActiveStorage::Blob
      .joins(:attachments)
      .where(checksum: checksum, byte_size: byte_size)
      .where(user_owns_attachment_sql)
      .first
  end

  # SQL fragment that checks if current_user owns the attached record
  # Uses EXISTS subqueries for efficiency (no loading IDs into memory)
  def user_owns_attachment_sql
    <<~SQL.squish
      (
        (active_storage_attachments.record_type = 'Document'
         AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{current_user.id}))
        OR
        (active_storage_attachments.record_type = 'Avatar'
         AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{current_user.id}))
      )
    SQL
  end

  def existing_blob_json(blob)
    {
      id: blob.id,
      key: blob.key,
      filename: blob.filename.to_s,
      content_type: blob.content_type,
      byte_size: blob.byte_size,
      checksum: blob.checksum,
      signed_id: blob.signed_id,
      direct_upload: nil  # Signal to skip upload
    }
  end

  def blob_params
    params.require(:blob).permit(:filename, :byte_size, :checksum, :content_type, metadata: {})
  end
end

This queries through the active_storage_attachments table to find blobs attached to records the current user owns. Adjust attachable_types and user_record_ids to match your application’s ownership model.

Add the route:

# config/routes.rb
Rails.application.routes.draw do
  post '/rails/active_storage/direct_uploads',
       to: 'deduplicated_uploads#create',
       as: :deduplicated_direct_uploads
end

How the Attachment Works

When we find an existing blob, we return its signed_id. The client submits this signed_id with the form, and ActiveStorage creates a new attachment pointing to the existing blob:

# app/controllers/documents_controller.rb
class DocumentsController < ApplicationController
  def create
    @document = current_user.documents.build(document_params)

    if @document.save
      redirect_to @document
    else
      render :new
    end
  end

  private

  def document_params
    params.require(:document).permit(:title, :file)
  end
end

When params[:document][:file] is a signed_id string (not an uploaded file), ActiveStorage finds the blob and attaches it. One blob can have many attachments:

blob = ActiveStorage::Blob.find_by(checksum: "vckNNU4TN7zbzl+o3tjXPQ==")
blob.attachments.count
# => 5  (five different records share this blob)

This is the key to deduplication: multiple records reference the same stored file.

Step 2: Client-Side Upload Handling

The client needs to know when to skip the upload. When direct_upload is null in the response, the file already exists:

// app/javascript/controllers/deduplicated_upload_controller.js
import { Controller } from "@hotwired/stimulus"
import { DirectUpload } from "@rails/activestorage"

export default class extends Controller {
  static targets = ["input", "progress"]
  static values = { url: String }

  upload() {
    const file = this.inputTarget.files[0]
    if (!file) return

    const upload = new DirectUpload(file, this.urlValue, this)

    upload.create((error, blob) => {
      if (error) {
        console.error(error)
      } else {
        this.handleSuccess(blob)
      }
    })
  }

  handleSuccess(blob) {
    // Create hidden input with signed_id
    const input = document.createElement("input")
    input.type = "hidden"
    input.name = this.inputTarget.name
    input.value = blob.signed_id
    this.inputTarget.form.appendChild(input)

    if (blob.direct_upload === null) {
      this.showMessage("File already exists - instant upload!")
    } else {
      this.showMessage("Upload complete")
    }
  }

  // DirectUpload delegate methods
  directUploadWillStoreFileWithXHR(request) {
    request.upload.addEventListener("progress", (event) => {
      const progress = (event.loaded / event.total) * 100
      this.progressTarget.style.width = `${progress}%`
    })
  }

  showMessage(text) {
    // Update UI to show status
    this.progressTarget.textContent = text
  }
}

Use it in your form:

<%= form_with model: @document do |f| %>
  <div data-controller="deduplicated-upload"
       data-deduplicated-upload-url-value="<%= deduplicated_direct_uploads_url %>">

    <%= f.file_field :file,
        data: {
          deduplicated_upload_target: "input",
          action: "change->deduplicated-upload#upload"
        } %>

    <div data-deduplicated-upload-target="progress"></div>
  </div>

  <%= f.submit %>
<% end %>

Step 3: Skip the Network Request Entirely

We can go further. Compute the checksum client-side and check if the blob exists before attempting upload:

Extract the scoping logic into a concern to share between controllers:

# app/controllers/concerns/blob_scoping.rb
module BlobScoping
  extend ActiveSupport::Concern

  def find_user_blob(checksum, byte_size)
    return nil if checksum.blank? || byte_size.blank?

    ActiveStorage::Blob
      .joins(:attachments)
      .where(checksum: checksum, byte_size: byte_size)
      .where(user_owns_attachment_sql)
      .first
  end

  private

  def user_owns_attachment_sql
    <<~SQL.squish
      (
        (active_storage_attachments.record_type = 'Document'
         AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{current_user.id}))
        OR
        (active_storage_attachments.record_type = 'Avatar'
         AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{current_user.id}))
      )
    SQL
  end
end

Then use it in the lookup controller:

# app/controllers/blob_lookups_controller.rb
class BlobLookupsController < ApplicationController
  include BlobScoping
  before_action :authenticate_user!

  def show
    blob = find_user_blob(params[:checksum], params[:byte_size].to_i)

    if blob
      render json: { exists: true, signed_id: blob.signed_id }
    else
      render json: { exists: false }
    end
  end
end
# config/routes.rb
# Use query param for checksum (base64 contains +, /, = which need encoding in paths)
get '/blobs/lookup', to: 'blob_lookups#show', as: :blob_lookup

Update the controller to read from query params:

# app/controllers/blob_lookups_controller.rb
def show
  blob = find_user_blob(params[:checksum], params[:byte_size].to_i)
  # ... rest unchanged
end

Now the JavaScript can check first:

// app/javascript/controllers/smart_upload_controller.js
import { Controller } from "@hotwired/stimulus"
import { DirectUpload, FileChecksum } from "@rails/activestorage"

export default class extends Controller {
  static targets = ["input", "status"]
  static values = {
    lookupUrl: String,
    uploadUrl: String
  }

  async upload() {
    const file = this.inputTarget.files[0]
    if (!file) return

    try {
      this.statusTarget.textContent = "Computing checksum..."
      const checksum = await this.computeChecksum(file)

      this.statusTarget.textContent = "Checking for duplicates..."
      const existing = await this.lookupBlob(checksum, file.size)

      if (existing.exists) {
        this.statusTarget.textContent = "File already uploaded - instant!"
        this.attachSignedId(existing.signed_id)
        return
      }

      this.statusTarget.textContent = "Uploading..."
      await this.performUpload(file)
    } catch (error) {
      this.statusTarget.textContent = `Error: ${error.message}`
      console.error("Upload failed:", error)
    }
  }

  computeChecksum(file) {
    return new Promise((resolve, reject) => {
      FileChecksum.create(file, (error, checksum) => {
        if (error) {
          reject(new Error(`Checksum failed: ${error}`))
        } else {
          resolve(checksum)
        }
      })
    })
  }

  async lookupBlob(checksum, byteSize) {
    const url = new URL(this.lookupUrlValue, window.location.origin)
    url.searchParams.set("checksum", checksum)
    url.searchParams.set("byte_size", byteSize)

    const response = await fetch(url, {
      headers: {
        "X-CSRF-Token": this.csrfToken,
        "Accept": "application/json"
      },
      credentials: "same-origin"
    })

    if (!response.ok) {
      throw new Error(`Lookup failed: ${response.status}`)
    }

    return response.json()
  }

  get csrfToken() {
    const meta = document.querySelector('meta[name="csrf-token"]')
    return meta ? meta.content : ""
  }

  attachSignedId(signedId) {
    const input = document.createElement("input")
    input.type = "hidden"
    input.name = this.inputTarget.name
    input.value = signedId
    this.inputTarget.form.appendChild(input)
  }

  performUpload(file) {
    return new Promise((resolve, reject) => {
      const upload = new DirectUpload(file, this.uploadUrlValue, this)
      upload.create((error, blob) => {
        if (error) {
          reject(new Error(`Upload failed: ${error}`))
        } else {
          this.statusTarget.textContent = "Upload complete"
          this.attachSignedId(blob.signed_id)
          resolve(blob)
        }
      })
    })
  }

  directUploadWillStoreFileWithXHR(request) {
    request.upload.addEventListener("progress", (event) => {
      const percent = Math.round((event.loaded / event.total) * 100)
      this.statusTarget.textContent = `Uploading: ${percent}%`
    })
  }
}

Expanding Scope

User-scoped deduplication is the safest default, but you might want broader deduplication in some cases. Here are options for expanding scope:

Option A: Organization/tenant scope

For SaaS apps where teammates share files, deduplicate within the organization. This requires querying records owned by any organization member:

def find_org_blob(checksum, byte_size)
  return nil if checksum.blank? || byte_size.blank?

  ActiveStorage::Blob
    .joins(:attachments)
    .where(checksum: checksum, byte_size: byte_size)
    .where(org_owns_attachment_sql)
    .first
end

def org_owns_attachment_sql
  # Find blobs attached to records owned by any org member
  <<~SQL.squish
    (
      (active_storage_attachments.record_type = 'Document'
       AND EXISTS (
         SELECT 1 FROM documents
         WHERE documents.id = active_storage_attachments.record_id
         AND documents.organization_id = #{current_organization.id}
       ))
      OR
      (active_storage_attachments.record_type = 'Project'
       AND EXISTS (
         SELECT 1 FROM projects
         WHERE projects.id = active_storage_attachments.record_id
         AND projects.organization_id = #{current_organization.id}
       ))
    )
  SQL
end

Option B: Public files only

Allow global deduplication for files explicitly marked public:

def find_public_blob(checksum, byte_size)
  return nil if checksum.blank? || byte_size.blank?

  ActiveStorage::Blob
    .where(checksum: checksum, byte_size: byte_size)
    .where("metadata->>'public' = ?", "true")
    .first
end

def find_existing_blob(checksum, byte_size)
  find_user_blob(checksum, byte_size) || find_public_blob(checksum, byte_size)
end

Option C: Content-type based (server-side only)

Deduplicate “safe” content like images globally, but keep documents user-scoped. This only works for server-side deduplication (Step 1) where we have the content_type from the upload request—it won’t work with the pre-upload lookup (Step 3):

# In DeduplicatedUploadsController
def find_existing_blob_for_user(checksum, byte_size)
  return nil if checksum.blank? || byte_size.blank?

  content_type = blob_params[:content_type]

  if content_type&.start_with?("image/")
    # Images can be deduplicated globally (low risk)
    ActiveStorage::Blob.find_by(checksum: checksum, byte_size: byte_size)
  else
    # Documents stay user-scoped
    find_user_blob(checksum, byte_size)
  end
end

Handling Edge Cases

Race conditions: The same user uploads a file twice in quick succession. Both checksums come back as “not found,” both upload. You end up with two blobs. This is fine. The user-scoped query already handles this naturally, and you can clean up duplicates per-user with a background job:

# app/jobs/deduplicate_user_blobs_job.rb
class DeduplicateUserBlobsJob < ApplicationJob
  def perform(user)
    duplicates = find_duplicate_checksums_for(user)

    duplicates.each do |checksum|
      deduplicate_blobs_with_checksum(user, checksum)
    end
  end

  private

  def find_duplicate_checksums_for(user)
    ActiveStorage::Blob
      .joins(:attachments)
      .where(user_owns_attachment_sql(user))
      .group(:checksum)
      .having("COUNT(DISTINCT active_storage_blobs.id) > 1")
      .pluck(:checksum)
  end

  def deduplicate_blobs_with_checksum(user, checksum)
    ActiveStorage::Blob.transaction do
      blobs = ActiveStorage::Blob
        .joins(:attachments)
        .where(checksum: checksum)
        .where(user_owns_attachment_sql(user))
        .order(:created_at)
        .lock("FOR UPDATE")
        .distinct

      canonical = blobs.first
      return if canonical.nil?

      # Use where.not instead of offset (offset doesn't work with find_each)
      blobs.where.not(id: canonical.id).find_each do |duplicate|
        duplicate.attachments.update_all(blob_id: canonical.id)
        duplicate.purge
      end
    end
  end

  def user_owns_attachment_sql(user)
    <<~SQL.squish
      (
        (active_storage_attachments.record_type = 'Document'
         AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{user.id}))
        OR
        (active_storage_attachments.record_type = 'Avatar'
         AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{user.id}))
      )
    SQL
  end
end

Orphaned blobs: The find_user_blob method already joins through attachments, so orphaned blobs (those with no attachments) are automatically excluded.

Different filenames: Same content, different names. The blob stores the original filename, but attachments can override it. This is fine—dedupe on content, not metadata.

Preventing Premature Blob Deletion

There’s a critical issue with shared blobs: by default, ActiveStorage purges the blob when you delete a record. If Document A and Document B share a blob, deleting Document A would delete the blob—breaking Document B.

Fix this by disabling automatic purging on your models:

# app/models/document.rb
class Document < ApplicationRecord
  belongs_to :user
  has_one_attached :file, dependent: false  # Don't auto-purge
end

# app/models/avatar.rb
class Avatar < ApplicationRecord
  belongs_to :user
  has_one_attached :image, dependent: false
end

Now blobs persist even when their attachments are deleted. Clean up orphaned blobs (those with zero attachments) with a scheduled job:

# app/jobs/cleanup_orphaned_blobs_job.rb
class CleanupOrphanedBlobsJob < ApplicationJob
  def perform
    # Find blobs with no attachments, older than 1 day (grace period)
    ActiveStorage::Blob
      .left_joins(:attachments)
      .where(active_storage_attachments: { id: nil })
      .where(active_storage_blobs: { created_at: ...1.day.ago })
      .find_each(&:purge)
  end
end

Schedule it to run daily. With Solid Queue (Rails 8 default):

# config/recurring.yml
cleanup_orphaned_blobs:
  class: CleanupOrphanedBlobsJob
  schedule: every day at 3am

Or with sidekiq-cron:

# config/initializers/sidekiq.rb
Sidekiq::Cron::Job.create(
  name: "Cleanup orphaned blobs - daily",
  cron: "0 3 * * *",
  class: "CleanupOrphanedBlobsJob"
)

The 1-day grace period prevents race conditions where a blob is created but not yet attached.

Measuring Impact

Track your deduplication rate:

# In your controller
def create
  existing_blob = find_user_blob(blob_params[:checksum])

  if existing_blob
    Rails.logger.info "[Dedup] Reused blob #{existing_blob.id} for user #{current_user.id} (#{existing_blob.byte_size} bytes saved)"
    StatsD.increment("uploads.deduplicated")
    StatsD.count("uploads.bytes_saved", existing_blob.byte_size)
    # ...
  end
end

Check duplication potential per user:

def duplication_stats_for(user)
  # Find duplicate checksums for this user's blobs
  duplicate_checksums = ActiveStorage::Blob
    .joins(:attachments)
    .where(user_owns_attachment_sql(user))
    .group(:checksum)
    .having("COUNT(DISTINCT active_storage_blobs.id) > 1")
    .pluck(:checksum)

  return { duplicates: 0, wasted_bytes: 0 } if duplicate_checksums.empty?

  # Calculate wasted space
  duplicate_blobs = ActiveStorage::Blob
    .joins(:attachments)
    .where(checksum: duplicate_checksums)
    .where(user_owns_attachment_sql(user))
    .distinct

  total_bytes = duplicate_blobs.sum(:byte_size)
  unique_bytes = duplicate_blobs.select("DISTINCT ON (checksum) *").sum(&:byte_size)

  {
    duplicates: duplicate_blobs.count - duplicate_checksums.size,
    wasted_bytes: total_bytes - unique_bytes
  }
end

def user_owns_attachment_sql(user)
  <<~SQL.squish
    (
      (active_storage_attachments.record_type = 'Document'
       AND EXISTS (SELECT 1 FROM documents WHERE documents.id = active_storage_attachments.record_id AND documents.user_id = #{user.id}))
      OR
      (active_storage_attachments.record_type = 'Avatar'
       AND EXISTS (SELECT 1 FROM avatars WHERE avatars.id = active_storage_attachments.record_id AND avatars.user_id = #{user.id}))
    )
  SQL
end

Security Considerations

File deduplication introduces attack surfaces worth understanding. Here’s how to mitigate them:

Checksum collisions

ActiveStorage uses MD5 by default. While MD5 is fast, it’s cryptographically broken—collisions can be generated, though they’re still expensive (~$50k for a chosen-prefix collision). We mitigate this by:

  1. Requiring both checksum AND byte_size to match (implemented throughout this tutorial). An attacker would need to create a collision with the exact same file size.
  2. Scoping to the current user’s files. Even with a collision, you can only access your own blobs.

For stronger guarantees, Rails 8.2 adds support for SHA256 checksums via PR #54123. Configure it per-service:

# config/storage.yml
amazon:
  service: S3
  bucket: your-bucket
  checksum_algorithm: SHA256  # Instead of default MD5

If you’re building a new app or in a FIPS-compliant environment, prefer SHA256.

Rate limiting lookups

The blob lookup endpoint could be abused to enumerate what files exist. Add rate limiting:

# app/controllers/blob_lookups_controller.rb
class BlobLookupsController < ApplicationController
  include BlobScoping
  before_action :authenticate_user!

  # Using Rack::Attack or similar
  # Limit to 60 lookups per minute per user
  rate_limit to: 60, within: 1.minute, by: -> { current_user.id }

  def show
    blob = find_user_blob(params[:checksum], params[:byte_size].to_i)

    if blob
      render json: { exists: true, signed_id: blob.signed_id }
    else
      render json: { exists: false }
    end
  end
end

With Rails 8’s built-in rate limiting, this is straightforward. For older Rails versions, use Rack::Attack.

Expanding scope carefully

The “Expanding Scope” options above progressively increase risk. User-scoped deduplication is safe by default. Organization scope is reasonable for trusted teams. Global deduplication (even for “safe” content types) should only be used when you fully understand the implications—an attacker who knows a file’s checksum and size could potentially confirm its existence or obtain a signed URL.

Wrapping Up

Deduplication saves storage costs and makes uploads feel instant for returning files. The key insight: ActiveStorage already computes checksums—we just need to use them.

By scoping to the current user and matching both checksum and byte_size, you get the storage savings without security risks. Users can only deduplicate against their own uploads, preventing them from claiming access to files they shouldn’t have.

Start with server-side deduplication in the controller. Add client-side lookup if you want to skip uploads entirely. Expand scope to organization or public files only when you have a clear use case and understand the security tradeoffs.

For new applications, consider configuring SHA256 checksums in Rails 8.2+ for stronger integrity guarantees.

Acknowledgements

Thanks to @juliknl (blog) for pointing out the MD5 collision risks and recommending the byte_size check.