feat: update blog
Some checks failed
continuous-integration/drone/push Build encountered an error

Signed-off-by: kjuulh <contact@kjuulh.io>
This commit is contained in:
2026-01-23 23:01:57 +01:00
parent 708ebad9ed
commit 664dffeb19

View File

@@ -3,90 +3,123 @@ type: blog-post
title: "Tales of the Homelab I: Moving is fun" title: "Tales of the Homelab I: Moving is fun"
description: description:
draft: false draft: false
date: 2026-01-20 date: 2026-01-23
updates: updates:
- time: 2026-01-20 - time: 2026-01-23
description: first iteration description: first iteration
tags: tags:
- "#blog" - "#blog"
- "#rust"
- "#homelab" - "#homelab"
--- ---
I love my homelab, it is an amalgam of random machines, both effecient and not, hosted and not, all pretty janky though. A homelab reflects a lot about what kind of an operator you are. A homelab is a hobby, and all of us come from different backgrounds with various interests. I love my homelab. It is an amalgamation of random machines both efficient and not, hosted and not, pretty janky overall. A homelab reflects a lot about what kind of operator you are. Its a hobby, and we all come from different backgrounds with different interests.
Some like to replace applications when google kills them, some like to tinker and nerd about performance, others like to build applications. I like to own my data, and kid myself into believing it is cheaper (it isn't electricity and hardware ain't cheap y'all), and I like to just build stuff, if it wasn't apprarent in the previous post. Some like to replace applications when Google kills them, some like to tinker and nerd out about performance, others like to build applications. I like to own my data, kid myself into believing its cheaper (it isnt, electricity and hardware aint cheap, yall), and I like to just build stuff, if that wasnt apparent from the previous post.
A homelab is a term that isn't clearly defined, to me it is basically the meme. A homelab is a term that isnt clearly defined. To me, its basically the meme:
> Web: here is the cloud, > Web: here is the cloud
> Hobbyist: Cloud at home. > Hobbyist: cloud at home
It can be anything from a raspberry pi, an old Lenovo ThinkPad, to a full-scale rack with enterprise gear etc. Often with the two states existing at the same time. It can be anything from a Raspberry Pi, to an old Lenovo ThinkPad, to a full-scale rack with enterprise gear and often several of those states exist at the same time.
My homelab is definitely in that state, various raspberry pis, minipcs, old workstations, network gear etc. I basically have two sides to my homelab, one is my media / home related stuff, the other is my software brain, with pcs running kubernetes, docker, this blog etc. My homelab is definitely in that state: various Raspberry Pis, mini PCs, old workstations, network gear, etc. I basically have two sides to my homelab. One is my media / home-related stuff; the other is my software brain, with PCs running Kubernetes, Docker, this blog, and so on.
It all started with one of my minipcs, it has a few nvme drives, runs proxmox (basically a virtual machine hypervisor, data center at home), it runs: It all started with one of my mini PCs. It has a few NVMe drives and runs Proxmox (basically a virtual machine hypervisor datacenter at home). It runs:
- Home assistant where it all started, I needed an upgrade from running it on a raspberry pi * Home Assistant, where it all started I needed an upgrade from running it on a Raspberry Pi
- Minio (s3 server) * MinIO (S3 server)
- Vault (secrets provider) * Vault (secrets provider)
- Drone (ci runner) * Drone (CI runner)
- Harbor ... * Harbor...
- Renova... * Renova...
- Zitad... * Zitadel...
- Todo... * Todo...
- Blo... * Blo...
- Gi... * Gi...
- P... * P...
In total 19 vms. You might be saying, and I don't want to hear it. That is simply too many. A big glaring single point of failure, foreshadowing for ya right there. In total: **19 VMs**.
My other nodes run highly available kubernetes, with replicated storage and so on. It depends on the central node however for database and secrets. You might be saying and I dont want to hear it that this is simply too many. A big, glaring single point of failure. Foreshadowing, right there.
Sooo, I was moving and little bit stressed because I was starting new work at the same time, so I basically packaged everything in a box / back of my car, and moved it. My other nodes run highly available Kubernetes with replicated storage and so on. They do, however, depend on the central node for database and secrets.
It took a week before I got around to setting up my central minipc again, as I simply began to miss my jellyfin media center, all filled with legally procured media I assure you. ## Moving
I didn't think too much of it, plug it in on top of a kitchen counter, and hearing it spin up, and nothing came online. I've got monitoring for all my services and none was resolved, curious. I grabbed a spare screen and plugged it in, curious. So, I was moving, and a little bit stressed because I was starting a new job at the same time (day, idiot). I basically packed everything into boxes / the back of my car and moved it.
It took about a week before I got around to setting up my central mini PC again, as I simply began to miss my Jellyfin media center filled with legally procured media, I assure you.
I didnt think too much of it. Plugged it in on top of a kitchen counter, heard it spin up... and nothing came online. Ive got monitoring for all my services, and none of it resolved. Curious.
I grabbed a spare screen and plugged it in.
```bash ```bash
systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool
``` ```
Hmm, very much hmm. Smells of hardware failure, no panic. Hmm. Very much *hmm*. Smells like hardware failure, but no panic yet.
I had an extra ssd in the box I used for all the volumes for the vms. It had been a little loose I'd noticed, but it hasn't been a problem before, the enclosure is meant for a full hdd, not a smaller ssd. I had an extra SSD in the box the one used for all the VM volumes. Id noticed it had been a little loose before, but it hadnt been a problem. The enclosure is meant for a full HDD, not a smaller SSD.
Next I tried to reseat the ssd. No luck. Slightly panicky I found one of my other pcs, and tried to plug in the ssd to see if it was just the internal connector that was broken. I tried reseating the SSD. No luck.
Nope! Nope! Dead SSD, absolutely dead.
The box wouldn't boot without the zfs-pool, so next I needed a way to stop that from happening, using the Proxmox console I could get into it, and disable the zfs import, and could then reboot. The proxmox UI however was a bloodbath. 0/19 vms running. F@ck. Slightly panicky now, I found another PC and plugged the SSD into that to check whether it was just the internal connector.
As it turns out there is sometimes a reason why we do the contingencies we do professionaly, highly available installations with 3-2-1 backup strategies etc. Even though my services had an uptime of 99% up until then, the single point of failure struct, leaving me with a lot of damage. Nope. Nope. Dead SSD. Absolutely dead.
As it turns out the way I had "designed" my vm installations was using a separate boot-drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base os and boot drive and separating the actual data. This is quite convenient as it allows a vms to be more slim as you don't have to pay for each base os. My debian base was about 20GB allocated, so that would've been 20 * 19. Not too bad, and honestly I would've paid that cost, if I'd paid attention. The box wouldnt boot without the ZFS pool, so I needed a way to stop that from happening. Using live boot Linux usb, I could disable the ZFS import and reboot.
So that left me with vms that wouldn't boot, because the boot disk was gone. Like a head without a body, dog without a bone (https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV), you get it. The Proxmox UI, however, was a bloodbath.
After a brief moment of panic, and actually it was quite brief, because all my "data" had been backed up, so that was my first priority to check, and yep, what I cared about (code on gitea, and my family's data was all backed up and available still). I should've tested my contingencies more but I am glad I had both monitoring for it, though my restoration processes could've been better. I restored these on one my for my old workstations that I use for development and restored my most important services, files and code. **0/19 VMs running.**
F@ck.
I did have backups of the vms, buuut, they were backed up to the extra drive, which had the failure. That was dumb... As it turns out, theres sometimes a reason we do the contingencies we do professionally high availability setups, 3-2-1 backup strategies, etc. Even though my services had enjoyed ~99% uptime until then, the single point of failure struck, leaving a lot of damage.
However, I had a theory that I could fix it, I basically had to replace the boot partition for the vms with a new one, and then retarget the boot drive to point to the new boot drive. Basically giving the dog the bone back. The way I had `designed` my VM installations was by using a separate boot drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base OS boot disk while separating actual data. Its quite convenient and keeps VMs slim.
It was not fun, but I did manage to restore matrix, home assistant, blog, drone, postgresql and gitea. These were pretty much the ones I cared about the most that was recoverable. The rest had their data also on the extra disk. My Debian base image was about 20 GB. That wouldve been 20 GB × 19 VMs. Not terrible and honestly, I wouldve paid that cost if Id been paying attention.
I may or may not share how I actually fixed it, but it has been a while and I would have to basically redo all the steps again. So probably not. Instead, I was left with VMs that wouldnt boot because their boot disk was gone. Like a head without a body. [A dog without a bone](https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV).
So yeah, my kubernetes cluster was basically borked (if you know you know), I still had all my data, but none of the services worked, because most of them relies on secrets from vault, which was gone. So yeah, I had to start over, pretty much. It wasn't a big loss though, all my data was backed up in postgres, and all my configuration in a gitops architecture in gitea. After a brief panic actually quite brief I checked what mattered first: backups. And yes, the important things (code in Gitea, family data) were all backed up and available. I shouldve tested my contingencies better, but at least monitoring worked.
I restored the most important services on one of my old workstations that I use for development.
I *did* have backups of the VMs... but they were backed up to the same extra drive that had failed.
That was dumb.
However, I had a theory. I could replace the missing boot disks with new ones and reattach them to the existing VM data disks. Basically, give the dog its bone back.
It was not fun but I managed to restore Matrix, Home Assistant, this blog, Drone, PostgreSQL, and Gitea. Those were the ones I cared about most and that were actually recoverable. The rest had their data living exclusively on the dead disk.
I may or may not share how I fixed it. Its been a while, and Id have to reconstruct all the steps. So probably not.
At this point, my Kubernetes cluster was basically *borked* (if you know, you know). All the data was there, but none of the services worked most of them depended on secrets from Vault, which was gone.
So I had to start over. Pretty much.
It wasnt a huge loss, though. All my data lived in Postgres backups, and all configuration was stored GitOps-style in Gitea.
## Postmortem ## Postmortem
To be honest, I never quite got all of vms working, this is fine, I could've gotten it working again, but this was also a chance to improve my setup and finally move some of my things into highly available compute. And replace some components I wasn't happy with. Harbor being one; so heavy to run, and fragile. Basically all my java services had to go. Not because I hate java necessarily, but because they're often far too resource intensive for my homelab, it is running on minipcs after all. I can't have it taking up all the ram, and cpu for pretty much nothing. I never fully restored all the VMs and thats fine. I *could* have, but this was also a good opportunity to improve my setup and finally move more things into highly available compute. It was also a chance to replace components I wasnt happy with. Basically the eternal cycle of a homelab.
I've since improved my backup setup dramatically. I now use a proper mirrored and raid setup on my workstations for both the main workloads and backups, as well as an offsite backup. Using zfs with zrepl, borgmatic / borgbackup for the offsite, postgres has incremental backups with pgbackrest. All are still monitored, now using a new monitoring platform built upon open telemetry and signoz. I replaced the 5 different grafana services with signoz and open telemetry. It works fine, but there is definitely some growing pains in replacing promql with sql. Harbor was one of them. Its heavy and fragile. Basically, all my Java services had to go. Not because I hate Java but because theyre often far too resource-intensive for a homelab running on mini PCs. I cant have services consuming all RAM and CPU for very little benefit.
Probably in the next post I'll share how I do compute, kubernetes from home, and potentially my other homelab oops, nearly losing all my family's wishes for christmas ;) I swear I am a professional, but we all sometimes make mistakes, it is important to learn from them, and fix problems even if they seem impossible to resolve. Since then, Ive significantly improved my backup setup. I now use proper mirrored RAID setups on my workstations for both workloads and backups, plus an offsite backup.
Have a great friday, and I hope to see you in the next post. * ZFS with zrepl
* Borgmatic / BorgBackup for offsite
* PostgreSQL incremental backups with pgBackRest
Everything is monitored. I also replaced five different Grafana services with a single monitoring platform built on OpenTelemetry and SigNoz. It works well, though replacing PromQL with SQL definitely has some growing pains.
In the next post, Ill probably share how I do compute, Kubernetes from home and maybe another homelab oops, like the time I nearly lost all my familys Christmas wishes 😉
I swear Im a professional. But we all make mistakes sometimes. What matters is learning from them and fixing problems even when they seem impossible.
Have a great Friday, and I hope to see you in the next post.