diff --git a/content/posts/2026-01-23-tales-of-the-homelab.md b/content/posts/2026-01-23-tales-of-the-homelab.md index cebe17e..709ba30 100644 --- a/content/posts/2026-01-23-tales-of-the-homelab.md +++ b/content/posts/2026-01-23-tales-of-the-homelab.md @@ -3,90 +3,123 @@ type: blog-post title: "Tales of the Homelab I: Moving is fun" description: draft: false -date: 2026-01-20 +date: 2026-01-23 updates: - - time: 2026-01-20 + - time: 2026-01-23 description: first iteration tags: - "#blog" - - "#rust" - "#homelab" --- -I love my homelab, it is an amalgam of random machines, both effecient and not, hosted and not, all pretty janky though. A homelab reflects a lot about what kind of an operator you are. A homelab is a hobby, and all of us come from different backgrounds with various interests. +I love my homelab. It is an amalgamation of random machines both efficient and not, hosted and not, pretty janky overall. A homelab reflects a lot about what kind of operator you are. It’s a hobby, and we all come from different backgrounds with different interests. -Some like to replace applications when google kills them, some like to tinker and nerd about performance, others like to build applications. I like to own my data, and kid myself into believing it is cheaper (it isn't electricity and hardware ain't cheap y'all), and I like to just build stuff, if it wasn't apprarent in the previous post. +Some like to replace applications when Google kills them, some like to tinker and nerd out about performance, others like to build applications. I like to own my data, kid myself into believing it’s cheaper (it isn’t, electricity and hardware ain’t cheap, y’all), and I like to just build stuff, if that wasn’t apparent from the previous post. -A homelab is a term that isn't clearly defined, to me it is basically the meme. +A homelab is a term that isn’t clearly defined. To me, it’s basically the meme: -> Web: here is the cloud, -> Hobbyist: Cloud at home. +> Web: here is the cloud +> Hobbyist: cloud at home -It can be anything from a raspberry pi, an old Lenovo ThinkPad, to a full-scale rack with enterprise gear etc. Often with the two states existing at the same time. +It can be anything from a Raspberry Pi, to an old Lenovo ThinkPad, to a full-scale rack with enterprise gear and often several of those states exist at the same time. -My homelab is definitely in that state, various raspberry pis, minipcs, old workstations, network gear etc. I basically have two sides to my homelab, one is my media / home related stuff, the other is my software brain, with pcs running kubernetes, docker, this blog etc. +My homelab is definitely in that state: various Raspberry Pis, mini PCs, old workstations, network gear, etc. I basically have two sides to my homelab. One is my media / home-related stuff; the other is my software brain, with PCs running Kubernetes, Docker, this blog, and so on. -It all started with one of my minipcs, it has a few nvme drives, runs proxmox (basically a virtual machine hypervisor, data center at home), it runs: +It all started with one of my mini PCs. It has a few NVMe drives and runs Proxmox (basically a virtual machine hypervisor datacenter at home). It runs: -- Home assistant where it all started, I needed an upgrade from running it on a raspberry pi -- Minio (s3 server) -- Vault (secrets provider) -- Drone (ci runner) -- Harbor ... -- Renova... -- Zitad... -- Todo... -- Blo... -- Gi... -- P... +* Home Assistant, where it all started I needed an upgrade from running it on a Raspberry Pi +* MinIO (S3 server) +* Vault (secrets provider) +* Drone (CI runner) +* Harbor... +* Renova... +* Zitadel... +* Todo... +* Blo... +* Gi... +* P... -In total 19 vms. You might be saying, and I don't want to hear it. That is simply too many. A big glaring single point of failure, foreshadowing for ya right there. +In total: **19 VMs**. -My other nodes run highly available kubernetes, with replicated storage and so on. It depends on the central node however for database and secrets. +You might be saying and I don’t want to hear it that this is simply too many. A big, glaring single point of failure. Foreshadowing, right there. -Sooo, I was moving and little bit stressed because I was starting new work at the same time, so I basically packaged everything in a box / back of my car, and moved it. +My other nodes run highly available Kubernetes with replicated storage and so on. They do, however, depend on the central node for database and secrets. -It took a week before I got around to setting up my central minipc again, as I simply began to miss my jellyfin media center, all filled with legally procured media I assure you. +## Moving -I didn't think too much of it, plug it in on top of a kitchen counter, and hearing it spin up, and nothing came online. I've got monitoring for all my services and none was resolved, curious. I grabbed a spare screen and plugged it in, curious. +So, I was moving, and a little bit stressed because I was starting a new job at the same time (day, idiot). I basically packed everything into boxes / the back of my car and moved it. + +It took about a week before I got around to setting up my central mini PC again, as I simply began to miss my Jellyfin media center filled with legally procured media, I assure you. + +I didn’t think too much of it. Plugged it in on top of a kitchen counter, heard it spin up... and nothing came online. I’ve got monitoring for all my services, and none of it resolved. Curious. + +I grabbed a spare screen and plugged it in. ```bash systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool ``` -Hmm, very much hmm. Smells of hardware failure, no panic. +Hmm. Very much *hmm*. Smells like hardware failure, but no panic yet. -I had an extra ssd in the box I used for all the volumes for the vms. It had been a little loose I'd noticed, but it hasn't been a problem before, the enclosure is meant for a full hdd, not a smaller ssd. +I had an extra SSD in the box the one used for all the VM volumes. I’d noticed it had been a little loose before, but it hadn’t been a problem. The enclosure is meant for a full HDD, not a smaller SSD. -Next I tried to reseat the ssd. No luck. Slightly panicky I found one of my other pcs, and tried to plug in the ssd to see if it was just the internal connector that was broken. -Nope! Nope! Dead SSD, absolutely dead. +I tried reseating the SSD. No luck. -The box wouldn't boot without the zfs-pool, so next I needed a way to stop that from happening, using the Proxmox console I could get into it, and disable the zfs import, and could then reboot. The proxmox UI however was a bloodbath. 0/19 vms running. F@ck. +Slightly panicky now, I found another PC and plugged the SSD into that to check whether it was just the internal connector. -As it turns out there is sometimes a reason why we do the contingencies we do professionaly, highly available installations with 3-2-1 backup strategies etc. Even though my services had an uptime of 99% up until then, the single point of failure struct, leaving me with a lot of damage. +Nope. Nope. Dead SSD. Absolutely dead. -As it turns out the way I had "designed" my vm installations was using a separate boot-drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base os and boot drive and separating the actual data. This is quite convenient as it allows a vms to be more slim as you don't have to pay for each base os. My debian base was about 20GB allocated, so that would've been 20 * 19. Not too bad, and honestly I would've paid that cost, if I'd paid attention. +The box wouldn’t boot without the ZFS pool, so I needed a way to stop that from happening. Using live boot Linux usb, I could disable the ZFS import and reboot. -So that left me with vms that wouldn't boot, because the boot disk was gone. Like a head without a body, dog without a bone (https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV), you get it. +The Proxmox UI, however, was a bloodbath. -After a brief moment of panic, and actually it was quite brief, because all my "data" had been backed up, so that was my first priority to check, and yep, what I cared about (code on gitea, and my family's data was all backed up and available still). I should've tested my contingencies more but I am glad I had both monitoring for it, though my restoration processes could've been better. I restored these on one my for my old workstations that I use for development and restored my most important services, files and code. +**0/19 VMs running.** +F@ck. -I did have backups of the vms, buuut, they were backed up to the extra drive, which had the failure. That was dumb... +As it turns out, there’s sometimes a reason we do the contingencies we do professionally high availability setups, 3-2-1 backup strategies, etc. Even though my services had enjoyed ~99% uptime until then, the single point of failure struck, leaving a lot of damage. -However, I had a theory that I could fix it, I basically had to replace the boot partition for the vms with a new one, and then retarget the boot drive to point to the new boot drive. Basically giving the dog the bone back. +The way I had `designed` my VM installations was by using a separate boot drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base OS boot disk while separating actual data. It’s quite convenient and keeps VMs slim. -It was not fun, but I did manage to restore matrix, home assistant, blog, drone, postgresql and gitea. These were pretty much the ones I cared about the most that was recoverable. The rest had their data also on the extra disk. +My Debian base image was about 20 GB. That would’ve been 20 GB × 19 VMs. Not terrible and honestly, I would’ve paid that cost if I’d been paying attention. -I may or may not share how I actually fixed it, but it has been a while and I would have to basically redo all the steps again. So probably not. +Instead, I was left with VMs that wouldn’t boot because their boot disk was gone. Like a head without a body. [A dog without a bone](https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV). -So yeah, my kubernetes cluster was basically borked (if you know you know), I still had all my data, but none of the services worked, because most of them relies on secrets from vault, which was gone. So yeah, I had to start over, pretty much. It wasn't a big loss though, all my data was backed up in postgres, and all my configuration in a gitops architecture in gitea. +After a brief panic actually quite brief I checked what mattered first: backups. And yes, the important things (code in Gitea, family data) were all backed up and available. I should’ve tested my contingencies better, but at least monitoring worked. + +I restored the most important services on one of my old workstations that I use for development. + +I *did* have backups of the VMs... but they were backed up to the same extra drive that had failed. + +That was dumb. + +However, I had a theory. I could replace the missing boot disks with new ones and reattach them to the existing VM data disks. Basically, give the dog its bone back. + +It was not fun but I managed to restore Matrix, Home Assistant, this blog, Drone, PostgreSQL, and Gitea. Those were the ones I cared about most and that were actually recoverable. The rest had their data living exclusively on the dead disk. + +I may or may not share how I fixed it. It’s been a while, and I’d have to reconstruct all the steps. So probably not. + +At this point, my Kubernetes cluster was basically *borked* (if you know, you know). All the data was there, but none of the services worked most of them depended on secrets from Vault, which was gone. + +So I had to start over. Pretty much. + +It wasn’t a huge loss, though. All my data lived in Postgres backups, and all configuration was stored GitOps-style in Gitea. ## Postmortem -To be honest, I never quite got all of vms working, this is fine, I could've gotten it working again, but this was also a chance to improve my setup and finally move some of my things into highly available compute. And replace some components I wasn't happy with. Harbor being one; so heavy to run, and fragile. Basically all my java services had to go. Not because I hate java necessarily, but because they're often far too resource intensive for my homelab, it is running on minipcs after all. I can't have it taking up all the ram, and cpu for pretty much nothing. +I never fully restored all the VMs and that’s fine. I *could* have, but this was also a good opportunity to improve my setup and finally move more things into highly available compute. It was also a chance to replace components I wasn’t happy with. Basically the eternal cycle of a homelab. -I've since improved my backup setup dramatically. I now use a proper mirrored and raid setup on my workstations for both the main workloads and backups, as well as an offsite backup. Using zfs with zrepl, borgmatic / borgbackup for the offsite, postgres has incremental backups with pgbackrest. All are still monitored, now using a new monitoring platform built upon open telemetry and signoz. I replaced the 5 different grafana services with signoz and open telemetry. It works fine, but there is definitely some growing pains in replacing promql with sql. +Harbor was one of them. It’s heavy and fragile. Basically, all my Java services had to go. Not because I hate Java but because they’re often far too resource-intensive for a homelab running on mini PCs. I can’t have services consuming all RAM and CPU for very little benefit. -Probably in the next post I'll share how I do compute, kubernetes from home, and potentially my other homelab oops, nearly losing all my family's wishes for christmas ;) I swear I am a professional, but we all sometimes make mistakes, it is important to learn from them, and fix problems even if they seem impossible to resolve. +Since then, I’ve significantly improved my backup setup. I now use proper mirrored RAID setups on my workstations for both workloads and backups, plus an offsite backup. -Have a great friday, and I hope to see you in the next post. +* ZFS with zrepl +* Borgmatic / BorgBackup for offsite +* PostgreSQL incremental backups with pgBackRest + +Everything is monitored. I also replaced five different Grafana services with a single monitoring platform built on OpenTelemetry and SigNoz. It works well, though replacing PromQL with SQL definitely has some growing pains. + +In the next post, I’ll probably share how I do compute, Kubernetes from home and maybe another homelab oops, like the time I nearly lost all my family’s Christmas wishes 😉 + +I swear I’m a professional. But we all make mistakes sometimes. What matters is learning from them and fixing problems even when they seem impossible. + +Have a great Friday, and I hope to see you in the next post.