Supavisor pooler and Storage connectivity issues in US-East-1
Incident Report for Supabase
Postmortem

Incident Summary

Date: January 29, 2025
Service Impacted: Supavisor – Postgres connection pooler
Affected Region: us-east-1
Duration: ~1 hour

Background

Supavisor is responsible for managing Postgres connections. The team scheduled a major upgrade to Supavisor 2.0 across 10 regions. The new version had already been successfully deployed in 9 smaller regions over the past few weeks, offering improvements in query latency and availability zone awareness.

Timeline of Events

  • 13:00 UTC – Maintenance began, and Supavisor 2.0 was successfully deployed to 9 regions with minimal downtime, requiring only client connection restarts.
  • 14:46 UTC – Deployment to us-east-1 was completed successfully with clients reconnected and throughput stable.
  • 16:05 UTC – A majority of clients disconnected and failed to reconnect reliably.
  • 16:11 UTC – Incident was declared, and rollback initiated.
  • 16:53 UTC – Rollback completed, restoring stability.

Root Cause Analysis

Primary Cause

The failure stemmed from an ETS (Erlang Term Storage) table corruption, affecting critical components:

  • 16:05:50 UTC – An :ets.select_delete operation failed, causing corruption.
  • 16:05:56 UTC – Process registry lookups, backed by ETS, in Syn and Cachex libraries began failing and prevented tenants from reconnecting.
  • 16:06:00 UTC – Processes managed by the affected process registry crashed, dropping connections for ~80% of tenants.

Resolution

  • Rollback to the previous Supavisor version was initiated.
  • The previous stable version restored normal operation.

Action Items

  • Faster rollbacks – Reduce rollback time from 40 minutes to under 90 seconds.
  • Improved alerting – Detect client disconnections within 2 minutes.

Conclusion

The incident highlighted weaknesses in rollback speed and resiliency to unforeseen failures under load. Future improvements will focus on rollback efficiency to reduce the impact of similar deployment failures.

Posted Feb 03, 2025 - 22:02 UTC

Resolved
This incident has been resolved.
Posted Jan 29, 2025 - 17:55 UTC
Monitoring
A failed Supavisor deploy has been rolled back.
Posted Jan 29, 2025 - 17:07 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 29, 2025 - 16:25 UTC
Investigating
We are currently investigating this issue. For database connections, you can fall back to direct connections for the time being. Storage connectivity will be restored once the connection pooler issues are resolved.
Posted Jan 29, 2025 - 16:13 UTC
This incident affected: Connection Pooler and Storage.