Partial Outage across multiple systems
Updates
Incident Summary
On 15 Dec 2025, starting at approximately 8:20 CET, we observed our Device Manager API not being able to handle any requests. We traced this back to our messaging system, which ran out of storage space. Because of this, dependent services were not able to push new messages to the system, blocking these services. This also began to impact other services, including the web application, mobile app, asynchronous jobs, payments and webhooks. At approximately 12:20 CET, the incident was resolved, after we reclaimed the storage in our messaging system.
Impact on Users
This incident caused a severe service degradation for all users for several recurring time periods of 5 to 15 minutes. Users were not able to access the web application in intermittent time periods. Additionally, payment services via self-service, mobile app and Open API were impacted and devices did not work as expected. We do not expect any data loss.
Our Response
-
15 Dec 2025 10:55:
Multiple services are experiencing repeated restart issues and are partially unavailable -
15 Dec 2025 11:16:
We suspect our messaging system to be the root cause for these issues and are currently restarting it -
15 Dec 2025 11:40:
The messaging system has been restarted and messages are being processed again -
15 Dec 2025 12:00:
The messaging system was observed to have insufficient storage. We contacted our messaging system’s service provider to assist us to determine the cause for the storage insufficiency. -
15 Dec 2025 12:24:
To resolve the disk space issue in our messaging system, we manually deleted and re-created a queue. During this time, the device manager had a downtime period of ~10min. It is now starting to operate as usual again.
Resolution
The storage was reclaimed by deleting and re-creating a message queue. This allowed our messaging system to resume normal operations. Shortly after, the dependent services recovered and resumed normal operations as well.
Lessons Learned
The threshold for alerting when the messaging system’s storage space is insufficient will be lowered. Additionally, a runbook will be created, containing the necessary steps to reclaim disk space. This allows us to react much faster next time, before any services are impacted. Furthermore, depending services will be made more resilient in case the messaging system is unavailable.
We implemented a resolution and are monitoring the situation.
We are continuing to experience issues across several systems and continue to investigate the root cause.
We are currently experiencing performance degradation and intermittent availability across several systems. Some features may be slow or unavailable. Our team is investigating the root cause.
← Back
PerfectGym Next Status