Equipment/Blanton/Kubernetes

From London Hackspace Wiki
Jump to navigation Jump to search

Intro

This page provides a bit of a braindump on the Kubernetes setup.

There is (currently) one master node on Blanton. I have thought about adding one on Landin, but it would require careful thought, since an even number of masters is generally discouraged since it can lead to split brain scenarios. Ideally we'd have a third machine to keep the number of masters odd.

Node Role Location Notes
kube-master Master Blanton 4 cores, 4GB of RAM
kube-node-blanton Worker Blanton 8 cores, 8GB of RAM
kube-node-landin Worker Landin 8 cores, 8GB of RAM
kube-node Worker Blanton 8 cores, 8GB of RAM
Older worker, mostly providing redundant capacity when one of the others are down for maintainence

As of writing, all of our containerised services can run on one of the nodes.

General Notes

I did try doing something with docker-compose, but the networking got unwealdy fast, and I realised I was about to create something not unlike Kubernetes but badly in a bunch of scripts! A big sticking point of what took so long to get this working was the dual stack IPv4 and IPv6 support needed to fit into the rest of the hackspace environment,

A few quick notes:

  • Networking is provided by Calico
  • LoadBalancer requests are serviced by metallb
    • If you want both IPv4 and IPv6 you will need to create two LoadBalancer instances pointing to the same service
  • nginx-ingress is configured to support HTTP/HTTPS services
  • cert-manager is configured to issue LetsEncrypt certificates automatically, assuming DNS entries are already in place (as would be needed for a regular VM wanting a cert)
    • Mark your ingress with the annotation cert-manager.io/cluster-issuer: "letsencrypt-prod"
  • there's a single-node glusterfs "cluster" providing storage
  • While it's all currently on Blanton, if there was another box (or ideally two) available, it would be possible to make this much more resilient
  • It's running a bleeding edge version of cert-issuer and ingress-nginx because I updated to 1.22 before things were ready :-)

MetalLB is configured to allocate IP addresses in the ranges 10.0.21.128/25 and 2a00:1d40:1843:182:f000::/68 - it uses layer 2 ARP to advertise these on the LAN.

Gaining access to the cluster currently requires a certificate, which is a huge pain in the rear end so I'm working on LDAP auth. I'm getting really close with this (it works but isn't the nicest to use yet)

Instructions Braindump

Adding a node

Kubernetes mostly requires a basic OS install, but there are a few steps you need to make sure you do correctly.

A key point here is that until recently, Kubernetes didn’t support nodes with swap enabled. These instructions therefore do not have swap. (I’m still not entirely convinced it’s a good idea!)

  1. Install latest Debian on a VM (without swap) with SSH and standard system utilities
  2. Stick your SSH key into /root/.ssh/authorized_keys and /home/<you>/.ssh/authorized_keys
  3. Add to the lhs-hosts section of Ansible (all nodes starting with kube-* get some basic kubernetes requirements installed) and deploy to it
  4. On the master, run
    kubeadm token create
    to get a token
  5. On the master run to get the cert hash:
    openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | \
    openssl dgst -sha256 -hex | sed 's/^.* //'
  6. On the new node, run
    kubeadm join --token <token> kube-master.lan.london.hackspace.org.uk:6443 --discovery-token-ca-cert-hash sha256:<hash>

Draining a node for mantainence

It's a good idea to shift work off a node when you're about do to anything to it (upgrade, reboot, e.t.c)

  1. run kubectl drain <node> --ignore-daemonsets

when you're done with the maintainence:

  1. run kubectl uncordon <node>

You might want to delete some pods that are running on remaining nodes so they get restarted more evenly spread across the nodes. Alternatively, you might want to just wait for usual updates and stuff to restart pods

Removing a node

  1. Drain the node
     kubectl drain <node>
  2. Delete the node record from Kubernetes
     kubectl delete node <node>
  3. Probably delete the VM or something - it's done now

Upgrading Kubernetes

Full instructions here: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ READ THEM!

It’s perhaps worth ignoring the 1.x.0 releases, since experience suggests things like metallb and callico might not yet support it in a stable version, which is a recipe for pain.

  1. On the master node, run apt-cache madison kubeadm to find a version to update to
  2. On the master node, run:
    sudo apt-get install -y --allow-change-held-packages kubeadm=<your-chosen-version>
    sudo kubeadm upgrade plan
    sudo kubeadm upgrade apply v<your-chosen-version>
    sudo apt-get install -y --allow-change-held-packages kubelet=<your-chosen-version> kubectl=<your-chosen-version>


On each node:

  1. Drain the node
  2. Run:
    sudo apt-get install -y --allow-change-held-packages kubeadm=<your-chosen-version>
    sudo kubeadm upgrade node
    sudo apt-get install -y --allow-change-held-packages kubelet=<your-chosen-version> kubectl=<your-chosen-version>
  3. run kubectl get nodes wherever you normally run kubectl to make sure the node is running the expected version
  4. Uncordon node

Fixing Screwups

Re-adding a node you removed by mistake

If you accidentaly run

kubectl delete node <node>

when you didn't mean to, don't panic - the workload should be shifted automatically to

a remaining node. here's how to re-add the one you just removed:

  1. Run kubeadm reset on the affected node
  2. Run kubeadm join as if it was a new node