High Availability
For production workloads, Linkerd’s control plane can run in high availability (HA) mode. This mode:
- Runs three replicas of critical control plane components.
- Sets production-ready CPU and memory resource requests on control plane components.
- Sets production-ready CPU and memory resource requests on data plane proxies
- Requires that the proxy auto-injector be functional for any pods to be scheduled.
- Sets anti-affinity policies on critical control plane components to ensure, if possible, that they are scheduled on separate nodes and in separate zones by default.
Enabling HA
You can enable HA mode at control plane installation time with the --ha
flag:
linkerd install --ha | kubectl apply -f -
You can override certain aspects of the HA behavior at installation time by
passing other flags to the install
command. For example, you can override the
number of replicas for critical components with the --controller-replicas
flag:
linkerd install --ha --controller-replicas=2 | kubectl apply -f -
See the full install
CLI documentation for
reference.
The linkerd upgrade
command can be used to enable HA mode on an existing
control plane:
linkerd upgrade --ha | kubectl apply -f -
Proxy injector failure policy
The HA proxy injector is deployed with a stricter failure policy to enforce automatic proxy injection. This setup ensures that no annotated workloads are accidentally scheduled to run on your cluster, without the Linkerd proxy. (This can happen when the proxy injector is down.)
If proxy injection process failed due to unrecognized or timeout errors during the admission phase, the workload admission will be rejected by the Kubernetes API server, and the deployment will fail.
Hence, it is very important that there is always at least one healthy replica of the proxy injector running on your cluster.
If you cannot guarantee the number of healthy proxy injector on your cluster,
you can loosen the webhook failure policy by setting its value to Ignore
, as
seen in the
Linkerd Helm chart.
Exclude the kube-system namespace
Per recommendation from the Kubernetes
documentation,
the proxy injector should be disabled for the kube-system
namespace.
This can be done by labeling the kube-system
namespace with the following
label:
kubectl label namespace kube-system config.linkerd.io/admission-webhooks=disabled
The Kubernetes API server will not call the proxy injector during the admission phase of workloads in namespace with this label.
If your Kubernetes cluster have built-in reconcilers that would revert any changes
made to the kube-system
namespace, you should loosen the proxy injector
failure policy following these instructions.
Pod anti-affinity rules
All critical control plane components are deployed with pod anti-affinity rules to ensure redundancy.
Linkerd uses a requiredDuringSchedulingIgnoredDuringExecution
pod
anti-affinity rule to ensure that the Kubernetes scheduler does not colocate
replicas of critical component on the same node. A
preferredDuringSchedulingIgnoredDuringExecution
pod anti-affinity rule is also
added to try to schedule replicas in different zones, where possible.
In order to satisfy these anti-affinity rules, HA mode assumes that there are always at least three nodes in the Kubernetes cluster. If this assumption is violated (e.g. the cluster is scaled down to two or fewer nodes), then the system may be left in a non-functional state.
Note that these anti-affinity rules don’t apply to add-on components like Prometheus and Grafana.
Scaling Prometheus
For production workloads, we recommend setting up your own Prometheus instance to scrape the data plane metrics, following the instructions here. This will provide you with more control over resource requirement, backup strategy and data retention.
When planning for memory capacity to store Linkerd timeseries data, the usual guidance is 5MB per meshed pod.
If your Prometheus is experiencing regular OOMKilled
events due to the amount
of data coming from the data plane, the two key parameters that can be adjusted
are:
storage.tsdb.retention.time
defines how long to retain samples in storage. A higher value implies that more memory is required to keep the data around for a longer period of time. Lowering this value will reduce the number ofOOMKilled
events as data is retained for a shorter period of timestorage.tsdb.retention.size
defines the maximum number of bytes that can be stored for blocks. A lower value will also help to reduce the number ofOOMKilled
events
For more information and other supported storage options, see the Prometheus documentation here.
Working with Cluster AutoScaler
The Linkerd proxy stores its mTLS private key in a tmpfs emptyDir volume to ensure that this information never leaves the pod. This causes the default setup of Cluster AutoScaler to not be able to scale down nodes with injected workload replicas.
The workaround is to annotate the injected workload with the
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
annotation. If you
have full control over the Cluster AutoScaler configuration, you can start the
Cluster AutoScaler with the --skip-nodes-with-local-storage=false
option.
For more information on this, see the Cluster AutoScaler documentation here.