Partial Outage across multiple systems

Major incident Web Application EU data center
2025-12-15 10:34 CET · 2 hours, 55 minutes, 1 second

Updates

Post-mortem

Incident Summary

On 15 Dec 2025, starting at approximately 8:20 CET, we observed our Device Manager API not being able to handle any requests. We traced this back to our messaging system, which ran out of storage space. Because of this, dependent services were not able to push new messages to the system, blocking these services. This also began to impact other services, including the web application, mobile app, asynchronous jobs, payments and webhooks. At approximately 12:20 CET, the incident was resolved, after we reclaimed the storage in our messaging system.

Impact on Users

This incident caused a severe service degradation for all users for several recurring time periods of 5 to 15 minutes. Users were not able to access the web application in intermittent time periods. Additionally, payment services via self-service, mobile app and Open API were impacted and devices did not work as expected. We do not expect any data loss.

Our Response

  • 15 Dec 2025 10:55:
    Multiple services are experiencing repeated restart issues and are partially unavailable

  • 15 Dec 2025 11:16:
    We suspect our messaging system to be the root cause for these issues and are currently restarting it

  • 15 Dec 2025 11:40:
    The messaging system has been restarted and messages are being processed again

  • 15 Dec 2025 12:00:
    The messaging system was observed to have insufficient storage. We contacted our messaging system’s service provider to assist us to determine the cause for the storage insufficiency.

  • 15 Dec 2025 12:24:
    To resolve the disk space issue in our messaging system, we manually deleted and re-created a queue. During this time, the device manager had a downtime period of ~10min. It is now starting to operate as usual again.

Resolution

The storage was reclaimed by deleting and re-creating a message queue. This allowed our messaging system to resume normal operations. Shortly after, the dependent services recovered and resumed normal operations as well.

Lessons Learned

The threshold for alerting when the messaging system’s storage space is insufficient will be lowered. Additionally, a runbook will be created, containing the necessary steps to reclaim disk space. This allows us to react much faster next time, before any services are impacted. Furthermore, depending services will be made more resilient in case the messaging system is unavailable.

December 19, 2025 · 09:38 CET
Resolved

The issue has been resolved.

December 15, 2025 · 13:29 CET
Update

We are continuing to monitor the situation.

December 15, 2025 · 12:34 CET
Monitoring

We implemented a resolution and are monitoring the situation.

December 15, 2025 · 11:49 CET
Update

We found the root cause and are working on a resolution.

December 15, 2025 · 11:21 CET
Update

We are continuing to experience issues across several systems and continue to investigate the root cause.

December 15, 2025 · 11:00 CET
Investigating

We are currently experiencing performance degradation and intermittent availability across several systems. Some features may be slow or unavailable. Our team is investigating the root cause.

December 15, 2025 · 10:34 CET

← Back