Maintenance in europe-west2-c

Resolved
Degraded performance
Started 11 months ago Lasted about 14 hours

Affected

Google Cloud Platform Stacks (GCP)
Updates
  • Resolved
    Resolved

    This incident has been resolved and the maintenance window is now complete.

  • Monitoring
    Monitoring

    We have found the underlying issue and mitigation efforts are actively in progress.

    Although there is a maintenance currently on-going from GCP this is expected and average frequency of maintenance within the GCP platform for sole tenanted environments is every 4 to 6 weeks and you can read more about this at : https://cloud.google.com/compute/docs/instances/host-maintenance-overview#maintenanceevents

    However there is a live migration feature that will stop disruption to servers when maintenance is on going, this means that servers will live migrate to another host during the migration with 0 impact to service - you can read about this feature and its limitations here: https://cloud.google.com/compute/docs/instances/live-migration-process

    The servers affected, do not meet any of those limitations hence we have been at a loss as to why these services are restarting rather than live migrating as expected.

    We have as of a few minutes ago, found the root cause of this being that these node pools have a feature enabled that has been done in order to reduce latency and improve performance called "Compact Placement" - This feature ensures that all nodes in each of our customers clusters are close to each other within the data centre ensuring the lowest latency possible when nodes talk to each other.

    Corefinity enabled this feature for clients last year given its potential benefits and at the time from the documentation, lack of disadvantages.

    However - we are now being told that enabling this feature means that servers will not live migrate during Google Cloud routine maintenance but will simply restart, a limitation that is clearly missing from the documentation on limitations of live migration - causing the few minutes of drop outs our clients in this zone have experienced today and earlier in May.

    This note is now on the documentation around compact placement (https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement) and Google have promised to add to the documentation page around limitations of live migration shortly.

    Corefinity will be performing emergency maintenance on all of its affected infrastructure within London GCP zone (europe-west-2) in order to permanently disable the compact placement feature and ensure this issue does not reoccur.

    We will be sending out notification in regards to the maintenance window shortly and we expect this work to be done in early hours of the morning with only a few minutes of disruption to service.

  • Investigating
    Investigating

    Corefinity has been experiencing elevated error rates within its customers hosted on Google Compute Platform specifically in the europe-west2-c zone.

    Customers who are hosted on a multi zone cluster are not affected.

    We are currently in direct contact with GCP support and will update this status ASAP.