Intro

This page provides a bit of a braindump on the Kubernetes setup.

There is (currently) one master node on Blanton. I have thought about adding one on Landin, but it would require careful thought, since an even number of masters is generally discouraged since it can lead to split brain scenarios. Ideally we'd have a third machine to keep the number of masters odd.

Node	Role	Location	Notes
kube-master	Master	Blanton	4 cores, 4GB of RAM
kube-node-blanton	Worker	Blanton	8 cores, 8GB of RAM
kube-node-landin	Worker	Landin	8 cores, 8GB of RAM
kube-node	Worker	Blanton	8 cores, 8GB of RAM Older worker, mostly providing redundant capacity when one of the others are down for maintainence

As of writing, all of our containerised services can run on one of the nodes.

General Notes

I did try doing something with docker-compose, but the networking got unwealdy fast, and I realised I was about to create something not unlike Kubernetes but badly in a bunch of scripts! A big sticking point of what took so long to get this working was the dual stack IPv4 and IPv6 support needed to fit into the rest of the hackspace environment,

A few quick notes:

Networking is provided by Calico
LoadBalancer requests are serviced by metallb
- If you want both IPv4 and IPv6 you will need to create two LoadBalancer instances pointing to the same service
nginx-ingress is configured to support HTTP/HTTPS services
cert-manager is configured to issue LetsEncrypt certificates automatically, assuming DNS entries are already in place (as would be needed for a regular VM wanting a cert)
- Mark your ingress with the annotation cert-manager.io/cluster-issuer: "letsencrypt-prod"
there's a single-node glusterfs "cluster" providing storage
While it's all currently on Blanton, if there was another box (or ideally two) available, it would be possible to make this much more resilient
It's running a bleeding edge version of cert-issuer and ingress-nginx because I updated to 1.22 before things were ready :-)

MetalLB is configured to allocate IP addresses in the ranges 10.0.21.128/25 and 2a00:1d40:1843:182:f000::/68 - it uses layer 2 ARP to advertise these on the LAN.

Gaining access to the cluster currently requires a certificate, which is a huge pain in the rear end so I'm working on LDAP auth. I'm getting really close with this (it works but isn't the nicest to use yet)

Instructions Braindump

Adding a node

Kubernetes mostly requires a basic OS install, but there are a few steps you need to make sure you do correctly.

A key point here is that until recently, Kubernetes didn’t support nodes with swap enabled. These instructions therefore do not have swap. (I’m still not entirely convinced it’s a good idea!)

Install latest Debian on a VM (without swap) with SSH and standard system utilities
Stick your SSH key into /root/.ssh/authorized_keys and /home/<you>/.ssh/authorized_keys
Add to the lhs-hosts section of Ansible (all nodes starting with kube-* get some basic kubernetes requirements installed) and deploy to it
On the master, run
```
kubeadm token create
```
to get a token

On the master run to get the cert hash:

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | \
openssl dgst -sha256 -hex | sed 's/^.* //'

On the new node, run

kubeadm join --token <token> kube-master.lan.london.hackspace.org.uk:6443 --discovery-token-ca-cert-hash sha256:<hash>

Draining a node for mantainence

It's a good idea to shift work off a node when you're about do to anything to it (upgrade, reboot, e.t.c)

run kubectl drain <node> --ignore-daemonsets

when you're done with the maintainence:

run kubectl uncordon <node>

You might want to delete some pods that are running on remaining nodes so they get restarted more evenly spread across the nodes. Alternatively, you might want to just wait for usual updates and stuff to restart pods

Removing a node

Drain the node
```
 kubectl drain <node>
```
Delete the node record from Kubernetes
```
 kubectl delete node <node>
```
Probably delete the VM or something - it's done now

Upgrading Kubernetes

Full instructions here: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ READ THEM!

It’s perhaps worth ignoring the 1.x.0 releases, since experience suggests things like metallb and callico might not yet support it in a stable version, which is a recipe for pain.

On the master node, run apt-cache madison kubeadm to find a version to update to

On the master node, run:

sudo apt-get install -y --allow-change-held-packages kubeadm=<your-chosen-version>

sudo kubeadm upgrade plan

sudo kubeadm upgrade apply v<your-chosen-version>

sudo apt-get install -y --allow-change-held-packages kubelet=<your-chosen-version> kubectl=<your-chosen-version>

On each node:

Drain the node

Run:

sudo apt-get install -y --allow-change-held-packages kubeadm=<your-chosen-version>

sudo kubeadm upgrade node

sudo apt-get install -y --allow-change-held-packages kubelet=<your-chosen-version> kubectl=<your-chosen-version>

run kubectl get nodes wherever you normally run kubectl to make sure the node is running the expected version
Uncordon node

Fixing Screwups

Re-adding a node you removed by mistake

If you accidentaly run

kubectl delete node <node>

when you didn't mean to, don't panic - the workload should be shifted automatically to

a remaining node. here's how to re-add the one you just removed:

Run kubeadm reset on the affected node
Run kubeadm join as if it was a new node

Equipment/Blanton/Kubernetes

Contents