

during a zone outage, critical self-healing or auto-scaling functionality won’t be available to you and your workload, if your control plane is down as well. Strictly speaking, static workload does not depend on the (high) availability of the control plane, but static workload doesn’t rhyme with Cloud and Kubernetes and also means, that when you possibly need it the most, e.g. The Kubernetes cluster control plane is managed by Gardener (as pods in separate infrastructure clusters to which you have no direct access) and can be set up with no failure tolerance (control plane pods will be recreated best-effort when resources are available) or one of the failure tolerance types node or zone. Maybe some of your components must run at the highest possible availability level, but others not - that’s a decision only you can make. if the cloud provider is out of resource capacity). a risk-based approach in between where you have means that will kick in, but they are not guaranteed to work (e.g. running all this excess capacity “just in case” vs. leader election is a pretty robust mechanism, auto-scaling may be required as well, etc.).Īlso remember that HA is costly - you need to balance it against the cost of an outage as silly as this may sound, e.g.
#Controlplane mac alternative manual
Always prefer automation over manual intervention (e.g. Also, focus more on meaningful availability than on internal signals (useful, but not as relevant as the former). The most important recommendation is to not target specific issues exclusively - tomorrow another service will fail in an unanticipated way. This and everything in between make it hard to prepare for such events, but you can still do a lot. All services down, temporarily or permanently (the proverbial burning down data center 🔥).all block device operations) or only parts of it (e.g.

Functional issues, of either the entire service (e.g.No networking at all, no DNS, machines shutting down or restarting, ….Network bandwidth reduced or latency increased, usually also effecting storage sub systems as they are network attached.Elevated cloud provider API error rates for individual or multiple services.There are many things that can go haywire. If you do not use Gardener, it may be still a worthwhile read as most settings can be influenced with most of the Kubernetes providers.įirst however, what is a zone outage? It sounds like a clear-cut “thing”, but it isn’t.
#Controlplane mac alternative how to
While many recommendations are general enough, the examples are specific in how to achieve this in a Gardener-managed cluster and where/how to tweak the different control plane components. In this blog, we will explore various recommendations to get closer to that goal. Developing highly available workload that can tolerate a zone outage is no trivial task.
