Investigating reports of degraded performance

Resolved
Degraded performance
Started over 1 year ago Lasted 15 days

Affected

Google Cloud Platform Stacks (GCP)
Amazon Web Services Stacks (AWS)
Updates
  • Resolved
    Resolved

    Corefinity has completed the rollout of the fix and all stacks have now returned to normal. All scheduled tasks/indexing issues should now be resolved and we are seeing significantly better performance across the board than before the incident.

    Corefinity will be releasing a full incident report within 72 hours.

  • Monitoring
    Update

    Unfortunately we are currently dealing with a join incident on the GCP Stacks, whereby a critical security update released overnight is being pushed to all Kubernetes infrastructure by GCP.

    Corefinity offers a completely redundant and scaleable stack and in normal times these updates are pushed silently with no impact to users at all, unfortunately due the incident currently ongoing with NFS degradation, we are seeing small dropouts at the moment exclusively on GCP stacks.

    We are working closely with GCP Cloud Toyko and GCP Cloud UK in order to pause the rollout of security updates until this incident is over.

    Corefinity engineering team is now also expediting our rollout of the NFS degradation fix and we estimate the rollout to be complete within 24 hours to all stacks.

  • Monitoring
    Update

    The rollout of the patch to all UAT/Staging environments is now complete and preliminary monitoring results show not only a complete resolution to the performance degradation seen due to these events but much better performance than before.

    We will continue to monitor for another 48 hours and have a full rollout plan to all affected stacks with no expected impact (i.e we can rollout with the redundancy in place with no downtime).

  • Monitoring
    Update

    We have now identified the root cause of the issue to be with a very specific bug within the 4.0 NFS client that is being used on all affected stacks.

    This has allowed us to start to work on a mitigation plan and we expect to be releasing a patch to all Staging and UAT environments within 24 hours of this update.

    We have scaled up all affected stacks at our cost to ensure none to minimal impact to front-end of each website and as such, we like to monitor Staging/UAT environments with the fix for a period of 72 hours and if no further issues are seen, we will rollout the fix to all affected websites. You will see better performance than before the incident and as promised we will be releasing a full incident report following these events. We'd like to thank you for your continued support and patience during this time and we feel extremely lucky and honoured to have such supportive customers and partners during these unprecedented events.

  • Monitoring
    Update

    Corefinity has deployed a number of optimisations over the past few days that has seen the impact of the degradation of NFS performance reduce to minimal.

    The impact to frontend of websites remains extremely low (if any in most circumstances) and we continue to respond to all alerts within seconds ensuring stacks are scaled up at our cost during this time; and as such we continue to explore all open options thoroughly to ensure once a permanent fix is deployed, that it is deployed with no further impact to clients. We'd like to thank all of our customers and partners who have been extremely understanding of the events over the past week, and we will continue to provide updates to you as progress is made. We will be releasing an incident report with all details (technical and none technical) once everything has stabilised.

  • Monitoring
    Update

    Corefinity is still working towards a permanent fix for degraded NFS performance. We have scaled up all affected stacks at our cost to ensure none to minimal impact to front-end of each website. We do still believe this incident to be open and are working around the clock to return performance to base or better.

  • Monitoring
    Update

    Corefinity has deployed its first fix for the degraded NFS performance within its GCP and AWS stacks. We are continuing to monitor the health of all affected websites and will be providing another update and an incident report once we are certain performance has returned to base (or better).

    We'd like to thank you for your patience during this time as this certainly is not the norm for us but rest assured we are working around the clock to return service to normal.

    Impact to frontend of websites is none to minimal and we will continue to monitor all websites and any frontend alerts will be dealt with as a matter of urgency as always.

  • Monitoring
    Update

    Corefinity continues to monitor all affected stacks. We are making progress in identifying the root cause of the issue and we will continue to provide updates meanwhile. Impact to the frontend of each website is none to minimal. We will continue to monitor all websites and any frontend alerts will be dealt with as a matter or urgency as always.

  • Monitoring
    Monitoring

    Corefinity has scaled up all stacks in order to deal as best as we can with the degraded performance identified. We will continue to investigate the root cause of the issues and once a permanent fix is in place, we will be releasing an incident report.

    During this time you may notice degraded performance on your scheduled tasks and we will do our best to return to normal as fast as we can.

  • Identified
    Identified

    We have identified the cause as reduced performance across all NFS servers that were updated as part of the planned maintenance. We are currently working on mitigation options.

  • Investigating
    Investigating

    Corefinity is currently investigating reports of degraded performance on its scaleable stacks hosted on GCP and AWS.

    We will provide another update within 2 hours.