Correcting Kops Etcd

August 17, 2017
ops deployment kubernetes k8s kops

Recently, I attempted to downsize a staging cluster from a 3 master HA setup, to a single master. The HA setup was unnecessary, and this was done primarily as a cost saving measure. Unfortunately, this ‘quick cost cut’ ended up taking multiple hours to clean up.

I looked at the initial situation rather niavely; kops will set up a multi-master deployment by creating an AutoScaling Group (ASG) for each master (assuming of course that you are using masters in separate Availability Zones). My quick solution to lower the number of masters was to simply delete 2 of these ASG’s. This worked in the sense that we now only had one master. It didn’t work in the sense that we had broken etcd’s quorum, and thus broke the functionality of the cluster.

At this point there were two options; 1, just replace the entire cluster, or 2, reset etcd. For a variety of reasons, creating a new cluster was not an optimal solution, so I went with resetting everything etcd-related on the remaining master.

Note that this will clear all Kubernetes resources from your cluster, and you will have to repopulate afterwards.

In order to do this, you’ll need to:

ssh into your remaining master
find the host volume paths for etcd and etcd-events. These will be in the manifest files under /etc/kubernetes/manifests repectively
Go into these directories, and clear them.
Restart your master.

At this point the Kubernetes control plane should become operational again. There is likely a more surgical approach that could be taken here (reconfiguring the etcd peers, rather than clearing all data), however this approach is seemed to be the quickest way to recover from a completely lost quorum.

Hopefully this works out for you, or was at least helpful. If you have any questions, don’t hesitate to shoot me an email, or follow me on twitter @nrmitchi.

Read more