Corrupted disk in multiple regions

Incident Report for Camunda Platform 8 SaaS

Resolved

Since the fix was applied, all services have been successfully restored and we no longer see any service interruptions.

Posted Feb 29, 2024 - 21:54 CET

Update

Our service provider has successfully performed a fix for all our workloads in all regions. Our operations are back to normal and all services have been restored. We will continue to monitor the situation.

Posted Feb 29, 2024 - 18:02 CET

Update

We are continuing to work on a fix with our cloud provider. As of this statement no production systems have been affected. We are in touch with customers who have been impacted.

Posted Feb 29, 2024 - 14:38 CET

Update

We continued working with our cloud provider and moved all affected services to stable nodes. We are now scaling down the affected workloads for cloud provider to apply a fix. This will have a minimal impact on the running services.

Posted Feb 29, 2024 - 04:57 CET

Update

The problems are still occurring and we are working with our cloud provider on a new strategy to remedy the situation.

Posted Feb 28, 2024 - 15:33 CET

Update

We followed the mitigation strategy of our cloud provider and have a workaround in place to resolve errors related to the disk issues. We don't expect interruptions to our services, and will continue to monitor the situation.

Posted Feb 28, 2024 - 10:39 CET

Update

The backups for all clusters >= 8.2.4 have been completed.

Our cloud provider recommended us to mitigate the observed issues by migrating the workload to a prior GKE version. We dont see any disruptions of our services at the moment and worked on the recommended mitigation. We see less errors and will continue to monitor the situation.

Posted Feb 27, 2024 - 23:44 CET

Monitoring

The backups for all clusters >= 8.2.4 have been completed.

Our cloud provider recommended us to mitigate the observed issues by migrating the workload to a prior GKE version. We do not see any disruptions of our services at the moment and will continue to work on the recommended mitigation.

Posted Feb 27, 2024 - 23:25 CET

Update

We are in touch with our cloud provider and working together on a mitigation.

Posted Feb 27, 2024 - 20:16 CET

Update

We are still experiencing the issue and will proactively backup the data of all enterprise and professional clusters that are at risk of being affected.
For versions >= 8.2.4 there is no downtime expected.
For versions < 8.2.4 a short downtime for all Camunda applications will occur.

Posted Feb 27, 2024 - 19:56 CET

Investigating

We've spotted issues with disks being mounted on several regions and are currently investigating the issue.

Posted Feb 27, 2024 - 18:13 CET

This incident affected: Operate, Optimize, Tasklist, and Zeebe.