Recovering a Kubeadm Cluster

March 23, 2018
ops kubernetes

Recently a small cluster I maintain became unresponsive due to a failure of the kube-apiserver. Kubeadm clusters are currently limited to a single master, this meant that any interaction with the cluster was impossible.

Due to interaction through the API being (obviously) impossible, troubleshooting and recovery required digging a bit deeper. By ssh -ing to the physical node, I was able to at least determine that the kube-apiserver was attempting to start, and then being killed because it was not responding to a livenessProbe in time.

By looking at the logs on the host (they are under /var/log/containers), I could see that the kube-apiserver was failing due to being unable to contact etcd. After looking, it turns out etcd was (for a reason still unknown) not running.

This is still strange, because my understanding is that the Kubelet is supposed to monitor the components of the control plane (those specified in the --pod-manifest-path path).

At this point, if we are able to get etcd back up, the API server should become responsive again, and we should be good. Unfortunately, with the API server down, we can’t apply the kubeadm etcd manifest.

As long as our etcd storage is intact, we can get creative and recover to our previous state. If the etcd it not intact, then clearing out the etcd data directory will allow us to continue and recover, however any Kubernetes state will have been lost.

The steps (assuming an intact etcd data directory) are as follows:

Start etcd manually, with options taken from the etcd manifest. This can be either native to the host, or as a docker container

# Manually start etcd
> docker run -P \
	-v /var/lib/etcd:/var/lib/etcd \
	--network host \
	gcr.io/google_containers/etcd-amd64:3.0.17 \
		etcd \
			--listen-client-urls=http://127.0.0.1:2379 \
			--advertise-client-urls=http://127.0.0.1:2379 \
			--data-dir=/var/lib/etcd

Wait until the kube-apiserver comes up. You can either watch the kube-system namespace with kubectl, or watch the processes running on the host.
Apply the etcd manifest

kubectl apply -f /etc/kubernetes/manifests/etcd.yaml

At this point, the container will not start (due to a port conflict), however the container will have been created.

Kill the manually started etcd container. This will likely take the kube-apiserver down again for a brief moment
Wait for the Kubernetes controller etcd pod to restart, which will allow the kube-apiserver to come back up.

At this point, you should have a functional, self-hosted control plane again. Congrats!

Hopefully this works out for you, or was at least helpful. If you have any questions, don’t hesitate to shoot me an email, or follow me on twitter @nrmitchi.

Read more