Incident post-mortem analysis: Major connectivity loss on January 27, 2025

A detailed analysis of the incident that led to service outages. The incident impacted Compute API, Virtual Private Cloud and Managed Kubernetes.

Incident overview

On January 27, 2025, a routine release of the Compute and VPC control plane cluster in the eu-north1 region triggered a cascading failure across core infrastructure services within that region, leading to:

  • Complete failure of compute API operations (17:40–21:49 UTC)
  • Loss of external connectivity for user virtual machines (18:00–21:15 UTC)
  • Degradation of managed Kubernetes (mk8s) control planes and related services

The eu-west1 region was not affected by this event.

The incident was caused by a sudden spike in API requests to the Compute and VPC control plane APIs following the release, combined with misconfigurations in controlled service degradation mechanisms, which resulted in uncontrolled resource consumption on control plane nodes, internal API unavailability, and consequent service outages.

Impact

  • Compute API: All operations failed, including VM provisioning, scaling, and lifecycle management.

  • VPC: External connectivity for user VMs was lost; internal VPC connectivity remained operational in the vast majority of cases, with only isolated instances of degradation.

  • Managed Kubernetes (mk8s): User clusters became unhealthy, external connectivity for mk8s nodes was lost.

Timeline

Incident onset and initial assessment (17:36–18:06)

Time (UTC) Event
17:36 Routine release of an internal networking component initiated in the Compute and VPC control plane cluster.
17:40 First signs of service degradation; Compute API failures begin as internal APIs become partially unreachable.
17:46 Monitoring systems trigger alerts indicating control plane instability.
17:54 Control plane nodes become unreachable due to system resource exhaustion caused by abnormal resource consumption from control plane API components on these nodes.
17:58 The incident response team is assembled. Full-scale incident response procedures are initiated.
18:00 External connectivity for user VMs is lost.
18:06 Root cause identified as a massive spike in API requests originating from internal networking components, overwhelming control plane APIs.

Focused recovery efforts (18:06–19:40)

During this period, the incident response team:

  • Isolated control plane nodes from excessive load to regain control over them.
  • Restored administrative access to the affected nodes and stabilized core components.
  • Tested and refined rate-limiting mechanisms to control API request spikes.
  • In several controlled iterations, gradually reintroduced load to the control plane API, ensuring stability at each step before proceeding further.

Stabilization phase (19:40–23:12)

Time (UTC) Event
19:40 Notable reduction in API load observed. Control plane API stability improves after scaling down a number of internal components and disabling non-essential services.
20:08 By this point, the team confirmed that control plane API was fully operational, and the applied rate-limiting mechanisms ensured its stable performance under the current load.
21:05 mk8s clusters begin to recover as control plane stability improves.
21:15 External VM connectivity fully restored.
21:49 Compute API, mk8s clusters and all core services fully operational.
22:21 Resource limits applied to control plane API processes to prevent recurrence.
23:12 All temporary mitigations reviewed, and most were reverted to baseline configurations after verifying system stability.

Root cause

Control plane release trigger

A routine release of an internal component in the Compute and VPC control plane cluster led to a significant spike in requests to internal control plane APIs. This sudden surge overwhelmed the system, causing a rapid increase in memory usage.

Misconfiguration of controlled service degradation subsystem

The mechanisms intended to control service degradation under load were misconfigured. These configurations proved inadequate for the scale of the cluster and the load generated by the release. As a result, resource consumption grew unchecked, degrading the API server’s ability to handle requests effectively.

Cascading failures across control plane components

The unavailability of control plane APIs caused failures in dependent components, including VPC’s routing services. As control plane APIs degraded, node states within the infrastructure became inconsistent. This, in turn, caused incorrect nodes states in the VPC control plane. As a consequence, BGP routes were revoked, leading to the loss of default routes to external networks. Despite this, VPC interconnect between client VMs remained functional for the majority of VPC networks, with only rare cases of connectivity degradation.

Additionally, the unavailability of control plane APIs directly impacted Compute API operations, as it relies on these APIs for managing virtual machine lifecycle activities. This dependency caused a complete failure of Compute API functionality during the incident.

mk8s control plane degradation

The degradation of internal connectivity affected mk8s etcd clusters, rendering mk8s control planes unhealthy. DNS resolution issues further complicated recovery efforts, as mk8s nodes could not reconnect to the control plane without external DNS access.

Post-incident action plan

Critical action items

Control plane stability enhancements

  • Improve system resource consumption control for control plane API components to prevent uncontrolled resource usage
  • Introduce rate limiting and flow control policies to mitigate overload scenarios.
  • Improve mechanisms for out-of-band management of control plane nodes.

Service degradation framework

  • Redesign controlled degradation subsystems to ensure graceful failure modes during incidents.
  • Enable dynamic configuration of control plane API resource limits without requiring full restarts.

Network resilience improvements

  • Reduce VPC data path dependency on the control plane availability by decoupling critical control plane communication from dynamically managed network interfaces.
  • Deploy additional BGP route reflectors to improve routing redundancy.

Normal priority action items

Managed Kubernetes reliability improvements

  • Review DNS dependency architecture to improve cluster resilience against external DNS outages.
  • Enhance monitoring for mk8s etcd health and node connectivity status.

VPC enhancements

  • Improve BGP session monitoring and failover handling.
  • Implement fallback routing mechanisms for critical VPC paths.

Operational process adjustments

  • Automate common recovery procedures to reduce manual intervention during incidents.
  • Improve incident response documentation and runbooks for control plane outages.

We acknowledge the severity of this incident and are committed to addressing the root causes to improve system reliability and resilience.

Sign in to save this post