feat: tales of the homelab

Signed-off-by: kjuulh <contact@kjuulh.io>
This commit is contained in:
2026-01-23 22:49:35 +01:00
parent 4c389d607e
commit 708ebad9ed

View File

@@ -0,0 +1,92 @@
---
type: blog-post
title: "Tales of the Homelab I: Moving is fun"
description:
draft: false
date: 2026-01-20
updates:
- time: 2026-01-20
description: first iteration
tags:
- "#blog"
- "#rust"
- "#homelab"
---
I love my homelab, it is an amalgam of random machines, both effecient and not, hosted and not, all pretty janky though. A homelab reflects a lot about what kind of an operator you are. A homelab is a hobby, and all of us come from different backgrounds with various interests.
Some like to replace applications when google kills them, some like to tinker and nerd about performance, others like to build applications. I like to own my data, and kid myself into believing it is cheaper (it isn't electricity and hardware ain't cheap y'all), and I like to just build stuff, if it wasn't apprarent in the previous post.
A homelab is a term that isn't clearly defined, to me it is basically the meme.
> Web: here is the cloud,
> Hobbyist: Cloud at home.
It can be anything from a raspberry pi, an old Lenovo ThinkPad, to a full-scale rack with enterprise gear etc. Often with the two states existing at the same time.
My homelab is definitely in that state, various raspberry pis, minipcs, old workstations, network gear etc. I basically have two sides to my homelab, one is my media / home related stuff, the other is my software brain, with pcs running kubernetes, docker, this blog etc.
It all started with one of my minipcs, it has a few nvme drives, runs proxmox (basically a virtual machine hypervisor, data center at home), it runs:
- Home assistant where it all started, I needed an upgrade from running it on a raspberry pi
- Minio (s3 server)
- Vault (secrets provider)
- Drone (ci runner)
- Harbor ...
- Renova...
- Zitad...
- Todo...
- Blo...
- Gi...
- P...
In total 19 vms. You might be saying, and I don't want to hear it. That is simply too many. A big glaring single point of failure, foreshadowing for ya right there.
My other nodes run highly available kubernetes, with replicated storage and so on. It depends on the central node however for database and secrets.
Sooo, I was moving and little bit stressed because I was starting new work at the same time, so I basically packaged everything in a box / back of my car, and moved it.
It took a week before I got around to setting up my central minipc again, as I simply began to miss my jellyfin media center, all filled with legally procured media I assure you.
I didn't think too much of it, plug it in on top of a kitchen counter, and hearing it spin up, and nothing came online. I've got monitoring for all my services and none was resolved, curious. I grabbed a spare screen and plugged it in, curious.
```bash
systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool
```
Hmm, very much hmm. Smells of hardware failure, no panic.
I had an extra ssd in the box I used for all the volumes for the vms. It had been a little loose I'd noticed, but it hasn't been a problem before, the enclosure is meant for a full hdd, not a smaller ssd.
Next I tried to reseat the ssd. No luck. Slightly panicky I found one of my other pcs, and tried to plug in the ssd to see if it was just the internal connector that was broken.
Nope! Nope! Dead SSD, absolutely dead.
The box wouldn't boot without the zfs-pool, so next I needed a way to stop that from happening, using the Proxmox console I could get into it, and disable the zfs import, and could then reboot. The proxmox UI however was a bloodbath. 0/19 vms running. F@ck.
As it turns out there is sometimes a reason why we do the contingencies we do professionaly, highly available installations with 3-2-1 backup strategies etc. Even though my services had an uptime of 99% up until then, the single point of failure struct, leaving me with a lot of damage.
As it turns out the way I had "designed" my vm installations was using a separate boot-drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base os and boot drive and separating the actual data. This is quite convenient as it allows a vms to be more slim as you don't have to pay for each base os. My debian base was about 20GB allocated, so that would've been 20 * 19. Not too bad, and honestly I would've paid that cost, if I'd paid attention.
So that left me with vms that wouldn't boot, because the boot disk was gone. Like a head without a body, dog without a bone (https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV), you get it.
After a brief moment of panic, and actually it was quite brief, because all my "data" had been backed up, so that was my first priority to check, and yep, what I cared about (code on gitea, and my family's data was all backed up and available still). I should've tested my contingencies more but I am glad I had both monitoring for it, though my restoration processes could've been better. I restored these on one my for my old workstations that I use for development and restored my most important services, files and code.
I did have backups of the vms, buuut, they were backed up to the extra drive, which had the failure. That was dumb...
However, I had a theory that I could fix it, I basically had to replace the boot partition for the vms with a new one, and then retarget the boot drive to point to the new boot drive. Basically giving the dog the bone back.
It was not fun, but I did manage to restore matrix, home assistant, blog, drone, postgresql and gitea. These were pretty much the ones I cared about the most that was recoverable. The rest had their data also on the extra disk.
I may or may not share how I actually fixed it, but it has been a while and I would have to basically redo all the steps again. So probably not.
So yeah, my kubernetes cluster was basically borked (if you know you know), I still had all my data, but none of the services worked, because most of them relies on secrets from vault, which was gone. So yeah, I had to start over, pretty much. It wasn't a big loss though, all my data was backed up in postgres, and all my configuration in a gitops architecture in gitea.
## Postmortem
To be honest, I never quite got all of vms working, this is fine, I could've gotten it working again, but this was also a chance to improve my setup and finally move some of my things into highly available compute. And replace some components I wasn't happy with. Harbor being one; so heavy to run, and fragile. Basically all my java services had to go. Not because I hate java necessarily, but because they're often far too resource intensive for my homelab, it is running on minipcs after all. I can't have it taking up all the ram, and cpu for pretty much nothing.
I've since improved my backup setup dramatically. I now use a proper mirrored and raid setup on my workstations for both the main workloads and backups, as well as an offsite backup. Using zfs with zrepl, borgmatic / borgbackup for the offsite, postgres has incremental backups with pgbackrest. All are still monitored, now using a new monitoring platform built upon open telemetry and signoz. I replaced the 5 different grafana services with signoz and open telemetry. It works fine, but there is definitely some growing pains in replacing promql with sql.
Probably in the next post I'll share how I do compute, kubernetes from home, and potentially my other homelab oops, nearly losing all my family's wishes for christmas ;) I swear I am a professional, but we all sometimes make mistakes, it is important to learn from them, and fix problems even if they seem impossible to resolve.
Have a great friday, and I hope to see you in the next post.