Some checks failed
continuous-integration/drone/push Build encountered an error
Signed-off-by: kjuulh <contact@kjuulh.io>
128 lines
7.4 KiB
Markdown
128 lines
7.4 KiB
Markdown
---
|
||
type: blog-post
|
||
title: "Tales of the Homelab I: Moving is fun"
|
||
description: We all make mistakes, here is one of mine as I shared the tales of my homelab hobby. Revenge of the SSDs
|
||
draft: false
|
||
date: 2026-01-23
|
||
updates:
|
||
- time: 2026-01-23
|
||
description: first iteration
|
||
tags:
|
||
- "#blog"
|
||
- "#homelab"
|
||
---
|
||
|
||
I love my homelab. It is an amalgamation of random machines both efficient and not, hosted and not, pretty janky overall. A homelab reflects a lot about what kind of operator you are. It’s a hobby, and we all come from different backgrounds with different interests.
|
||
|
||
Some like to replace applications when Google kills them, some like to tinker and nerd out about performance, others like to build applications. I like to own my data, kid myself into believing it’s cheaper (it isn’t, electricity and hardware ain’t cheap, y’all), and I like to just build stuff, if that wasn’t apparent from the previous post.
|
||
|
||
A homelab is a term that isn’t clearly defined. To me, it’s basically the meme:
|
||
|
||
> Web: here is the cloud
|
||
> Hobbyist: cloud at home
|
||
|
||
It can be anything from a Raspberry Pi, to an old Lenovo ThinkPad, to a full-scale rack with enterprise gear and often several of those states exist at the same time.
|
||
|
||
My homelab is definitely in that state: various Raspberry Pis, mini PCs, old workstations, network gear, etc. I basically have two sides to my homelab. One is my media / home-related stuff; the other is my software brain, with PCs running Kubernetes, Docker, this blog, and so on.
|
||
|
||
It all started with one of my mini PCs. It has a few NVMe drives and runs Proxmox (basically a virtual machine hypervisor datacenter at home). It runs:
|
||
|
||
* Home Assistant, where it all started I needed an upgrade from running it on a Raspberry Pi
|
||
* MinIO (S3 server)
|
||
* Vault (secrets provider)
|
||
* Drone (CI runner)
|
||
* Harbor...
|
||
* Renova...
|
||
* Zitadel...
|
||
* Todo...
|
||
* Blo...
|
||
* Gi...
|
||
* P...
|
||
|
||
In total: **19 VMs**.
|
||
|
||
You might be saying and I don’t want to hear it that this is simply too many. A big, glaring single point of failure. Foreshadowing, right there.
|
||
|
||
My other nodes run highly available Kubernetes with replicated storage and so on. They do, however, depend on the central node for database and secrets.
|
||
|
||
## Moving
|
||
|
||
So, I was moving, and a little bit stressed because I was starting a new job at the same time (day, idiot). I basically packed everything into boxes / the back of my car and moved it.
|
||
|
||
It took about a week before I got around to setting up my central mini PC again, as I simply began to miss my Jellyfin media center filled with legally procured media, I assure you.
|
||
|
||
I didn’t think too much of it. Plugged it in on top of a kitchen counter, heard it spin up... and nothing came online. I’ve got monitoring for all my services, and none of it resolved. Curious.
|
||
|
||
I grabbed a spare screen and plugged it in.
|
||
|
||
```bash
|
||
systemd zfs-import.want: zfs pool unable to mount zfs-clank-pool
|
||
```
|
||
|
||
Hmm. Very much *hmm*. Smells like hardware failure, but no panic yet.
|
||
|
||
I had an extra SSD in the box the one used for all the VM volumes. I’d noticed it had been a little loose before, but it hadn’t been a problem. The enclosure is meant for a full HDD, not a smaller SSD.
|
||
|
||
I tried reseating the SSD. No luck.
|
||
|
||
Slightly panicky now, I found another PC and plugged the SSD into that to check whether it was just the internal connector.
|
||
|
||
Nope. Nope. Dead SSD. Absolutely dead.
|
||
|
||
The box wouldn’t boot without the ZFS pool, so I needed a way to stop that from happening. Using live boot Linux usb, I could disable the ZFS import and reboot.
|
||
|
||
The Proxmox UI, however, was a bloodbath.
|
||
|
||
**0/19 VMs running.**
|
||
F@ck.
|
||
|
||
As it turns out, there’s sometimes a reason we do the contingencies we do professionally high availability setups, 3-2-1 backup strategies, etc. Even though my services had enjoyed ~99% uptime until then, the single point of failure struck, leaving a lot of damage.
|
||
|
||
The way I had `designed` my VM installations was by using a separate boot drive and volume drive. This is a feature of KVM / Proxmox and allows sharing a base OS boot disk while separating actual data. It’s quite convenient and keeps VMs slim.
|
||
|
||
My Debian base image was about 20 GB. That would’ve been 20 GB × 19 VMs. Not terrible and honestly, I would’ve paid that cost if I’d been paying attention.
|
||
|
||
Instead, I was left with VMs that wouldn’t boot because their boot disk was gone. Like a head without a body. [A dog without a bone](https://youtu.be/ubWL8VAPoYw?si=iDd3Xk6NCkF1UkRV).
|
||
|
||
After a brief panic actually quite brief I checked what mattered first: backups. And yes, the important things (code in Gitea, family data) were all backed up and available. I should’ve tested my contingencies better, but at least monitoring worked.
|
||
|
||
I restored the most important services on one of my old workstations that I use for development.
|
||
|
||
I *did* have backups of the VMs... but they were backed up to the same extra drive that had failed.
|
||
|
||
That was dumb.
|
||
|
||
However, I had a theory. I could replace the missing boot disks with new ones and reattach them to the existing VM data disks. Basically, give the dog its bone back.
|
||
|
||
It was not fun but I managed to restore Matrix, Home Assistant, this blog, Drone, PostgreSQL, and Gitea. Those were the ones I cared about most and that were actually recoverable. The rest had their data living exclusively on the dead disk.
|
||
|
||
I may or may not share how I fixed it. It’s been a while, and I’d have to reconstruct all the steps. So probably not.
|
||
|
||
At this point, my Kubernetes cluster was basically *borked* (if you know, you know). All the data was there, but none of the services worked most of them depended on secrets from Vault, which was gone.
|
||
|
||
So I had to start over. Pretty much.
|
||
|
||
It wasn’t a huge loss, though. All my data lived in Postgres backups, and all configuration was stored GitOps-style in Gitea.
|
||
|
||
## Postmortem
|
||
|
||
I never fully restored all the VMs and that’s fine. I *could* have, but this was also a good opportunity to improve my setup and finally move more things into highly available compute. It was also a chance to replace components I wasn’t happy with. Basically the eternal cycle of a homelab.
|
||
|
||
Harbor was one of them. It’s heavy and fragile. Basically, all my Java services had to go. Not because I hate Java but because they’re often far too resource-intensive for a homelab running on mini PCs. I can’t have services consuming all RAM and CPU for very little benefit.
|
||
|
||
Since then, I’ve significantly improved my backup setup. I now use proper mirrored RAID setups on my workstations for both workloads and backups, plus an offsite backup.
|
||
|
||
> Fun fact: as I was building my new backup setup, I had another of these SSDs fail on me. That is 2/3 of my EVO Samsung SSDs I don't think I am going to be buying these again.
|
||
|
||
* ZFS with zrepl
|
||
* Borgmatic / BorgBackup for offsite
|
||
* PostgreSQL incremental backups with pgBackRest
|
||
|
||
Everything is monitored. I also replaced five different Grafana services with a single monitoring platform built on OpenTelemetry and SigNoz. It works well, though replacing PromQL with SQL definitely has some growing pains.
|
||
|
||
In the next post, I’ll probably share how I do compute, Kubernetes from home and maybe another homelab oops, like the time I nearly lost all my family’s Christmas wishes 😉
|
||
|
||
I swear I’m a professional. But we all make mistakes sometimes. What matters is learning from them and fixing problems even when they seem impossible.
|
||
|
||
Have a great Friday, and I hope to see you in the next post.
|