Date: January 29, 2025
Service Impacted: Supavisor – Postgres connection pooler
Affected Region: us-east-1
Duration: ~1 hour
Supavisor is responsible for managing Postgres connections. The team scheduled a major upgrade to Supavisor 2.0 across 10 regions. The new version had already been successfully deployed in 9 smaller regions over the past few weeks, offering improvements in query latency and availability zone awareness.
us-east-1
was completed successfully with clients reconnected and throughput stable. The failure stemmed from an ETS (Erlang Term Storage) table corruption, affecting critical components:
:ets.select_delete
operation failed, causing corruption.Syn
and Cachex
libraries began failing and prevented tenants from reconnecting.The incident highlighted weaknesses in rollback speed and resiliency to unforeseen failures under load. Future improvements will focus on rollback efficiency to reduce the impact of similar deployment failures.