feat: update blog

Signed-off-by: kjuulh <contact@kjuulh.io>
2026-01-23 23:01:57 +01:00
parent 708ebad9ed
commit 664dffeb19
1 changed files with 78 additions and 45 deletions
--- a/content/posts/2026-01-23-tales-of-the-homelab.md
+++ b/content/posts/2026-01-23-tales-of-the-homelab.md
@@ -3,90 +3,123 @@ type: blog-post
 title: "Tales of the Homelab I: Moving is fun"
 description:
 draft: false
-date: 2026-01-20
+date: 2026-01-23
 updates:
-  - time: 2026-01-20
+  - time: 2026-01-23
    description: first iteration
 tags:
  - "#blog"
  - "#rust"
  - "#homelab"
 ---
-I love my homelab, it is an amalgam of random machines, both effecient and not, hosted and not, all pretty janky though. A homelab reflects a lot about what kind of an operator you are. A homelab is a hobby, and all of us come from different backgrounds with various interests.
+I love my homelab. It is an amalgamation of random machines both efficient and not, hosted and not, pretty janky overall. A homelab reflects a lot about what kind of operator you are. It’s a hobby, and we all come from different backgrounds with different interests.
-Some like to replace applications when google kills them, some like to tinker and nerd about performance, others like to build applications. I like to own my data, and kid myself into believing it is cheaper (it isn't electricity and hardware ain't cheap y'all), and I like to just build stuff, if it wasn't apprarent in the previous post.
+Some like to replace applications when Google kills them, some like to tinker and nerd out about performance, others like to build applications. I like to own my data, kid myself into believing it’s cheaper (it isn’t, electricity and hardware ain’t cheap, y’all), and I like to just build stuff, if that wasn’t apparent from the previous post.
-A homelab is a term that isn't clearly defined, to me it is basically the meme.
+A homelab is a term that isn’t clearly defined. To me, it’s basically the meme:
-> Web: here is the cloud,
+> Web: here is the cloud
-> Hobbyist: Cloud at home. 
+> Hobbyist: cloud at home
-It can be anything from a raspberry pi, an old Lenovo ThinkPad, to a full-scale rack with enterprise gear etc. Often with the two states existing at the same time.
+It can be anything from a Raspberry Pi, to an old Lenovo ThinkPad, to a full-scale rack with enterprise gear and often several of those states exist at the same time.
-My homelab is definitely in that state, various raspberry pis, minipcs, old workstations, network gear etc. I basically have two sides to my homelab, one is my media / home related stuff, the other is my software brain, with pcs running kubernetes, docker, this blog etc.
+My homelab is definitely in that state: various Raspberry Pis, mini PCs, old workstations, network gear, etc. I basically have two sides to my homelab. One is my media / home-related stuff; the other is my software brain, with PCs running Kubernetes, Docker, this blog, and so on.
-It all started with one of my minipcs, it has a few nvme drives, runs proxmox (basically a virtual machine hypervisor, data center at home), it runs:
+It all started with one of my mini PCs. It has a few NVMe drives and runs Proxmox (basically a virtual machine hypervisor datacenter at home). It runs:
- Home assistant where it all started, I needed an upgrade from running it on a raspberry pi
+* Home Assistant, where it all started I needed an upgrade from running it on a Raspberry Pi
- Minio (s3 server)
+* MinIO (S3 server)
- Vault (secrets provider)
+* Vault (secrets provider)
- Drone (ci runner)
+* Drone (CI runner)
- Harbor ...
+* Harbor...
- Renova...
+* Renova...
- Zitad...
+* Zitadel...
- Todo...
+* Todo...
- Blo...
+* Blo...
- Gi...
+* Gi...
- P...
+* P...
-In total 19 vms. You might be saying, and I don't want to hear it. That is simply too many. A big glaring single point of failure, foreshadowing for ya right there.
+In total: **19 VMs**.
-My other nodes run highly available kubernetes, with replicated storage and so on. It depends on the central node however for database and secrets.
+You might be saying and I don’t want to hear it that this is simply too many. A big, glaring single point of failure. Foreshadowing, right there.
-Sooo, I was moving and little bit stressed because I was starting new work at the same time, so I basically packaged everything in a box / back of my car, and moved it.
+My other nodes run highly available Kubernetes with replicated storage and so on. They do, however, depend on the central node for database and secrets.
-It took a week before I got around to setting up my central minipc again, as I simply began to miss my jellyfin media center, all filled with legally procured media I assure you.
+## Moving
-I didn't think too much of it, plug it in on top of a kitchen counter, and hearing it spin up, and nothing came online. I've got monitoring for all my services and none was resolved, curious. I grabbed a spare screen and plugged it in, curious.
+So, I was moving, and a little bit stressed because I was starting a new job at the same time (day, idiot). I basically packed everything into boxes / the back of my car and moved it.
 It took about a week before I got around to setting up my central mini PC again, as I simply began to miss my Jellyfin media center filled with legally procured media, I assure you.
 I didn’t think too much of it. Plugged it in on top of a kitchen counter, heard it spin up... and nothing came online. I’ve got monitoring for all my services, and none of it resolved. Curious.
 I grabbed a spare screen and plugged it in.
 ```bash
 systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool
 ```
-Hmm, very much hmm. Smells of hardware failure, no panic.
+Hmm. Very much *hmm*. Smells like hardware failure, but no panic yet.
-I had an extra ssd in the box I used for all the volumes for the vms. It had been a little loose I'd noticed, but it hasn't been a problem before, the enclosure is meant for a full hdd, not a smaller ssd.
+I had an extra SSD in the box the one used for all the VM volumes. I’d noticed it had been a little loose before, but it hadn’t been a problem. The enclosure is meant for a full HDD, not a smaller SSD.
-Next I tried to reseat the ssd. No luck. Slightly panicky I found one of my other pcs, and tried to plug in the ssd to see if it was just the internal connector that was broken.
+I tried reseating the SSD. No luck.
 Nope! Nope! Dead SSD, absolutely dead.
-The box wouldn't boot without the zfs-pool, so next I needed a way to stop that from happening, using the Proxmox console I could get into it, and disable the zfs import, and could then reboot. The proxmox UI however was a bloodbath. 0/19 vms running. F@ck.
+Slightly panicky now, I found another PC and plugged the SSD into that to check whether it was just the internal connector.
-As it turns out there is sometimes a reason why we do the contingencies we do professionaly, highly available installations with 3-2-1 backup strategies etc. Even though my services had an uptime of 99% up until then, the single point of failure struct, leaving me with a lot of damage.
+Nope. Nope. Dead SSD. Absolutely dead.
-As it turns out the way I had "designed" my vm installations was using a separate boot-drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base os and boot drive and separating the actual data. This is quite convenient as it allows a vms to be more slim as you don't have to pay for each base os. My debian base was about 20GB allocated, so that would've been 20 * 19. Not too bad, and honestly I would've paid that cost, if I'd paid attention.
+The box wouldn’t boot without the ZFS pool, so I needed a way to stop that from happening. Using live boot Linux usb, I could disable the ZFS import and reboot.
-So that left me with vms that wouldn't boot, because the boot disk was gone. Like a head without a body, dog without a bone (https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV), you get it.
+The Proxmox UI, however, was a bloodbath.
-After a brief moment of panic, and actually it was quite brief, because all my "data" had been backed up, so that was my first priority to check, and yep, what I cared about (code on gitea, and my family's data was all backed up and available still). I should've tested my contingencies more but I am glad I had both monitoring for it, though my restoration processes could've been better. I restored these on one my for my old workstations that I use for development and restored my most important services, files and code.
+**0/19 VMs running.**
 F@ck.
-I did have backups of the vms, buuut, they were backed up to the extra drive, which had the failure. That was dumb...
+As it turns out, there’s sometimes a reason we do the contingencies we do professionally high availability setups, 3-2-1 backup strategies, etc. Even though my services had enjoyed ~99% uptime until then, the single point of failure struck, leaving a lot of damage.
-However, I had a theory that I could fix it, I basically had to replace the boot partition for the vms with a new one, and then retarget the boot drive to point to the new boot drive. Basically giving the dog the bone back.
+The way I had `designed` my VM installations was by using a separate boot drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base OS boot disk while separating actual data. It’s quite convenient and keeps VMs slim.
-It was not fun, but I did manage to restore matrix, home assistant, blog, drone, postgresql and gitea. These were pretty much the ones I cared about the most that was recoverable. The rest had their data also on the extra disk.
+My Debian base image was about 20 GB. That would’ve been 20 GB × 19 VMs. Not terrible and honestly, I would’ve paid that cost if I’d been paying attention.
-I may or may not share how I actually fixed it, but it has been a while and I would have to basically redo all the steps again. So probably not.
+Instead, I was left with VMs that wouldn’t boot because their boot disk was gone. Like a head without a body. [A dog without a bone](https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV).
-So yeah, my kubernetes cluster was basically borked (if you know you know), I still had all my data, but none of the services worked, because most of them relies on secrets from vault, which was gone. So yeah, I had to start over, pretty much. It wasn't a big loss though, all my data was backed up in postgres, and all my configuration in a gitops architecture in gitea.
+After a brief panic actually quite brief I checked what mattered first: backups. And yes, the important things (code in Gitea, family data) were all backed up and available. I should’ve tested my contingencies better, but at least monitoring worked.
 I restored the most important services on one of my old workstations that I use for development.
 I *did* have backups of the VMs... but they were backed up to the same extra drive that had failed.
 That was dumb.
 However, I had a theory. I could replace the missing boot disks with new ones and reattach them to the existing VM data disks. Basically, give the dog its bone back.
 It was not fun but I managed to restore Matrix, Home Assistant, this blog, Drone, PostgreSQL, and Gitea. Those were the ones I cared about most and that were actually recoverable. The rest had their data living exclusively on the dead disk.
 I may or may not share how I fixed it. It’s been a while, and I’d have to reconstruct all the steps. So probably not.
 At this point, my Kubernetes cluster was basically *borked* (if you know, you know). All the data was there, but none of the services worked most of them depended on secrets from Vault, which was gone.
 So I had to start over. Pretty much.
 It wasn’t a huge loss, though. All my data lived in Postgres backups, and all configuration was stored GitOps-style in Gitea.
 ## Postmortem
-To be honest, I never quite got all of vms working, this is fine, I could've gotten it working again, but this was also a chance to improve my setup and finally move some of my things into highly available compute. And replace some components I wasn't happy with. Harbor being one; so heavy to run, and fragile. Basically all my java services had to go. Not because I hate java necessarily, but because they're often far too resource intensive for my homelab, it is running on minipcs after all. I can't have it taking up all the ram, and cpu for pretty much nothing.
+I never fully restored all the VMs and that’s fine. I *could* have, but this was also a good opportunity to improve my setup and finally move more things into highly available compute. It was also a chance to replace components I wasn’t happy with. Basically the eternal cycle of a homelab.
-I've since improved my backup setup dramatically. I now use a proper mirrored and raid setup on my workstations for both the main workloads and backups, as well as an offsite backup. Using zfs with zrepl, borgmatic / borgbackup for the offsite, postgres has incremental backups with pgbackrest. All are still monitored, now using a new monitoring platform built upon open telemetry and signoz. I replaced the 5 different grafana services with signoz and open telemetry. It works fine, but there is definitely some growing pains in replacing promql with sql.
+Harbor was one of them. It’s heavy and fragile. Basically, all my Java services had to go. Not because I hate Java but because they’re often far too resource-intensive for a homelab running on mini PCs. I can’t have services consuming all RAM and CPU for very little benefit.
-Probably in the next post I'll share how I do compute, kubernetes from home, and potentially my other homelab oops, nearly losing all my family's wishes for christmas ;) I swear I am a professional, but we all sometimes make mistakes, it is important to learn from them, and fix problems even if they seem impossible to resolve.
+Since then, I’ve significantly improved my backup setup. I now use proper mirrored RAID setups on my workstations for both workloads and backups, plus an offsite backup.
-Have a great friday, and I hope to see you in the next post.
+* ZFS with zrepl
 * Borgmatic / BorgBackup for offsite
 * PostgreSQL incremental backups with pgBackRest
 Everything is monitored. I also replaced five different Grafana services with a single monitoring platform built on OpenTelemetry and SigNoz. It works well, though replacing PromQL with SQL definitely has some growing pains.
 In the next post, I’ll probably share how I do compute, Kubernetes from home and maybe another homelab oops, like the time I nearly lost all my family’s Christmas wishes 😉
 I swear I’m a professional. But we all make mistakes sometimes. What matters is learning from them and fixing problems even when they seem impossible.
 Have a great Friday, and I hope to see you in the next post.