diff --git a/content/posts/2026-01-23-tales-of-the-homelab.md b/content/posts/2026-01-23-tales-of-the-homelab.md new file mode 100644 index 0000000..cebe17e --- /dev/null +++ b/content/posts/2026-01-23-tales-of-the-homelab.md @@ -0,0 +1,92 @@ +--- +type: blog-post +title: "Tales of the Homelab I: Moving is fun" +description: +draft: false +date: 2026-01-20 +updates: + - time: 2026-01-20 + description: first iteration +tags: + - "#blog" + - "#rust" + - "#homelab" +--- + +I love my homelab, it is an amalgam of random machines, both effecient and not, hosted and not, all pretty janky though. A homelab reflects a lot about what kind of an operator you are. A homelab is a hobby, and all of us come from different backgrounds with various interests. + +Some like to replace applications when google kills them, some like to tinker and nerd about performance, others like to build applications. I like to own my data, and kid myself into believing it is cheaper (it isn't electricity and hardware ain't cheap y'all), and I like to just build stuff, if it wasn't apprarent in the previous post. + +A homelab is a term that isn't clearly defined, to me it is basically the meme. + +> Web: here is the cloud, +> Hobbyist: Cloud at home. + +It can be anything from a raspberry pi, an old Lenovo ThinkPad, to a full-scale rack with enterprise gear etc. Often with the two states existing at the same time. + +My homelab is definitely in that state, various raspberry pis, minipcs, old workstations, network gear etc. I basically have two sides to my homelab, one is my media / home related stuff, the other is my software brain, with pcs running kubernetes, docker, this blog etc. + +It all started with one of my minipcs, it has a few nvme drives, runs proxmox (basically a virtual machine hypervisor, data center at home), it runs: + +- Home assistant where it all started, I needed an upgrade from running it on a raspberry pi +- Minio (s3 server) +- Vault (secrets provider) +- Drone (ci runner) +- Harbor ... +- Renova... +- Zitad... +- Todo... +- Blo... +- Gi... +- P... + +In total 19 vms. You might be saying, and I don't want to hear it. That is simply too many. A big glaring single point of failure, foreshadowing for ya right there. + +My other nodes run highly available kubernetes, with replicated storage and so on. It depends on the central node however for database and secrets. + +Sooo, I was moving and little bit stressed because I was starting new work at the same time, so I basically packaged everything in a box / back of my car, and moved it. + +It took a week before I got around to setting up my central minipc again, as I simply began to miss my jellyfin media center, all filled with legally procured media I assure you. + +I didn't think too much of it, plug it in on top of a kitchen counter, and hearing it spin up, and nothing came online. I've got monitoring for all my services and none was resolved, curious. I grabbed a spare screen and plugged it in, curious. + +```bash +systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool +``` + +Hmm, very much hmm. Smells of hardware failure, no panic. + +I had an extra ssd in the box I used for all the volumes for the vms. It had been a little loose I'd noticed, but it hasn't been a problem before, the enclosure is meant for a full hdd, not a smaller ssd. + +Next I tried to reseat the ssd. No luck. Slightly panicky I found one of my other pcs, and tried to plug in the ssd to see if it was just the internal connector that was broken. +Nope! Nope! Dead SSD, absolutely dead. + +The box wouldn't boot without the zfs-pool, so next I needed a way to stop that from happening, using the Proxmox console I could get into it, and disable the zfs import, and could then reboot. The proxmox UI however was a bloodbath. 0/19 vms running. F@ck. + +As it turns out there is sometimes a reason why we do the contingencies we do professionaly, highly available installations with 3-2-1 backup strategies etc. Even though my services had an uptime of 99% up until then, the single point of failure struct, leaving me with a lot of damage. + +As it turns out the way I had "designed" my vm installations was using a separate boot-drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base os and boot drive and separating the actual data. This is quite convenient as it allows a vms to be more slim as you don't have to pay for each base os. My debian base was about 20GB allocated, so that would've been 20 * 19. Not too bad, and honestly I would've paid that cost, if I'd paid attention. + +So that left me with vms that wouldn't boot, because the boot disk was gone. Like a head without a body, dog without a bone (https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV), you get it. + +After a brief moment of panic, and actually it was quite brief, because all my "data" had been backed up, so that was my first priority to check, and yep, what I cared about (code on gitea, and my family's data was all backed up and available still). I should've tested my contingencies more but I am glad I had both monitoring for it, though my restoration processes could've been better. I restored these on one my for my old workstations that I use for development and restored my most important services, files and code. + +I did have backups of the vms, buuut, they were backed up to the extra drive, which had the failure. That was dumb... + +However, I had a theory that I could fix it, I basically had to replace the boot partition for the vms with a new one, and then retarget the boot drive to point to the new boot drive. Basically giving the dog the bone back. + +It was not fun, but I did manage to restore matrix, home assistant, blog, drone, postgresql and gitea. These were pretty much the ones I cared about the most that was recoverable. The rest had their data also on the extra disk. + +I may or may not share how I actually fixed it, but it has been a while and I would have to basically redo all the steps again. So probably not. + +So yeah, my kubernetes cluster was basically borked (if you know you know), I still had all my data, but none of the services worked, because most of them relies on secrets from vault, which was gone. So yeah, I had to start over, pretty much. It wasn't a big loss though, all my data was backed up in postgres, and all my configuration in a gitops architecture in gitea. + +## Postmortem + +To be honest, I never quite got all of vms working, this is fine, I could've gotten it working again, but this was also a chance to improve my setup and finally move some of my things into highly available compute. And replace some components I wasn't happy with. Harbor being one; so heavy to run, and fragile. Basically all my java services had to go. Not because I hate java necessarily, but because they're often far too resource intensive for my homelab, it is running on minipcs after all. I can't have it taking up all the ram, and cpu for pretty much nothing. + +I've since improved my backup setup dramatically. I now use a proper mirrored and raid setup on my workstations for both the main workloads and backups, as well as an offsite backup. Using zfs with zrepl, borgmatic / borgbackup for the offsite, postgres has incremental backups with pgbackrest. All are still monitored, now using a new monitoring platform built upon open telemetry and signoz. I replaced the 5 different grafana services with signoz and open telemetry. It works fine, but there is definitely some growing pains in replacing promql with sql. + +Probably in the next post I'll share how I do compute, kubernetes from home, and potentially my other homelab oops, nearly losing all my family's wishes for christmas ;) I swear I am a professional, but we all sometimes make mistakes, it is important to learn from them, and fix problems even if they seem impossible to resolve. + +Have a great friday, and I hope to see you in the next post.