Google Cloud Platform Stacks (GCP) - Operational
Amazon Web Services Stacks (AWS) - Operational
Microsoft Azure Stacks - Operational
Civo Stacks - Operational
Control Panel (manage.corefinity.com) - Operational
Deployment Pipelines - Operational
Notice history
Jul 2023
No notices reported this month
Jun 2023
- ResolvedResolved
This incident has been resolved and the maintenance window is now complete.
- MonitoringMonitoring
We have found the underlying issue and mitigation efforts are actively in progress.
Although there is a maintenance currently on-going from GCP this is expected and average frequency of maintenance within the GCP platform for sole tenanted environments is every 4 to 6 weeks and you can read more about this at : https://cloud.google.com/compute/docs/instances/host-maintenance-overview#maintenanceevents
However there is a live migration feature that will stop disruption to servers when maintenance is on going, this means that servers will live migrate to another host during the migration with 0 impact to service - you can read about this feature and its limitations here: https://cloud.google.com/compute/docs/instances/live-migration-process
The servers affected, do not meet any of those limitations hence we have been at a loss as to why these services are restarting rather than live migrating as expected.
We have as of a few minutes ago, found the root cause of this being that these node pools have a feature enabled that has been done in order to reduce latency and improve performance called "Compact Placement" - This feature ensures that all nodes in each of our customers clusters are close to each other within the data centre ensuring the lowest latency possible when nodes talk to each other.
Corefinity enabled this feature for clients last year given its potential benefits and at the time from the documentation, lack of disadvantages.
However - we are now being told that enabling this feature means that servers will not live migrate during Google Cloud routine maintenance but will simply restart, a limitation that is clearly missing from the documentation on limitations of live migration - causing the few minutes of drop outs our clients in this zone have experienced today and earlier in May.
This note is now on the documentation around compact placement (https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement) and Google have promised to add to the documentation page around limitations of live migration shortly.
Corefinity will be performing emergency maintenance on all of its affected infrastructure within London GCP zone (europe-west-2) in order to permanently disable the compact placement feature and ensure this issue does not reoccur.
We will be sending out notification in regards to the maintenance window shortly and we expect this work to be done in early hours of the morning with only a few minutes of disruption to service.
- InvestigatingInvestigating
Corefinity has been experiencing elevated error rates within its customers hosted on Google Compute Platform specifically in the europe-west2-c zone.
Customers who are hosted on a multi zone cluster are not affected.
We are currently in direct contact with GCP support and will update this status ASAP.
May 2023
- ResolvedResolved
This incident has been resolved and GCP maintenance has been completed.
- IdentifiedIdentified
There is currently a Google Compute Engine Maintenance in London.
We are observing multiple nodes at a time from different clusters being restarted with the following log:
…pe-west2-xx-xxx-xxxx-xxxxx-xxxxxxx-xxxxx system@google.com Instance terminated during Compute Engine maintenance.
We have an escalated support case with Google to determine the length of the maintenance.
Depending on the size of your cluster, most restarts will not cause a dropout due to the redundancy in place, however if multiple nodes in your cluster restart at the same time this could cause a dropout for a few minutes.
We apologies for the inconvenience caused, We did not have any notice of this maintenance and will follow up.
- InvestigatingInvestigating
We are currently investigating higher than normal errors rates on stacks in the following zones on GCP (europe-west2-c and europe-west2-b).