The Homelab

Production practices. Residential square footage.

It deploys from git. It pages me when it breaks. It serves real traffic from my house. Everything else I build is virtual — I wanted to run real cables. So I did.

3

M4 Mac mini k3s nodes

8

apps in production

100%

deployed from git (Flux CD)

500/500

Mbps, symmetric, to the street

The homelab rack: a UniFi switch and gateway glowing above three M4 Mac minis seated in blue 3D-printed alignment racks

From the street to a pod

Every request to an app on this page travels the same road: Starry’s fixed-wireless link to the roof, the UniFi gateway, a Traefik ingress, a pod on one of the minis. The part you can’t see is the second network riding on top — a Tailscale mesh that every Mac and VM joins, so cluster traffic moves on encrypted WireGuard paths no matter which physical box a pod lands on.

Internet Starry fixed wireless · 500/500 Mbps UniFi Gateway DHCP 192.168.1.x · forwards 80/443 Ethernet LAN 80/443 → Traefik jake-mini · M4 · 24 GB macOS host · headless Lima VM · Linux · 20 GiB k3s server · Traefik lima1 · 192.168.1.x tailscale0 · 100.x.y.z agent5 · M4 · 16 GB macOS host · headless Lima VM · Linux · 12 GiB k3s agent lima1 · 192.168.1.x tailscale0 · 100.x.y.z agentnan · M4 · 16 GB macOS host · headless Lima VM · Linux · 12 GiB k3s agent lima1 · 192.168.1.x tailscale0 · 100.x.y.z Tailscale mesh · WireGuard direct paths over the LAN · no DERP relay physical path Tailscale overlay (100.x) inbound request path
One rack, two networks: the gray path is what's physically plugged in; the blue dashes are the WireGuard mesh the cluster actually talks over.

Starry fixed wireless

500 Mbps symmetric. The upload matters more than the download when your house is the origin server.

UniFi gateway

Hands out the LAN, forwards exactly the ports the front door needs, and nothing else.

Tailscale mesh

Every Mac and VM is a node. Cluster traffic rides WireGuard, and the VMs hold real LAN IPs so peers connect directly — no DERP relay.

Cloudflare + cert-manager

Public names and TLS renew themselves. I haven’t thought about a certificate in months.

The fleet

Three M4 Mac minis. They were AI agent workstations before they were cluster nodes, and Apple Silicon’s performance-per-watt means the whole rack is silent and barely registers on the power bill.

jake-mini

control plane
  • M4 Mac mini, headless
  • 24 GB RAM · 20 GiB to the k3s server VM
  • macOS keeps ~4 GiB and stays out of the way

agent5

worker
  • M4 Mac mini, headless
  • 16 GB RAM · 12 GiB to the worker VM
  • Named for its first job: running agents

agentnan

worker
  • M4 Mac mini, headless
  • 16 GB RAM · 12 GiB to the worker VM
  • Same story, different name
Front view of the rack cart: the three M4 Mac minis side by side in their 3D-printed alignment racks beneath the UniFi gear

The NOC in the room

An Apple Silicon iMac runs as a dedicated kiosk: always on, always showing Grafana. It’s provisioned by the same Ansible as the fleet — hardening and a node_exporter, nothing else — and it is deliberately not a cluster node. Its only job is to make the cluster’s health ambient.

You notice problems differently when the dashboard is furniture. A pod stuck in CrashLoopBackOff isn’t an email you read tomorrow; it’s a red panel you walk past on the way to the kitchen.

An iMac on a desk glowing with the Grafana cluster dashboard — the same 'all systems go' panels, on all the time

The platform

This is a platform, not a pile of containers. Git is the source of truth: Flux reconciles every manifest in the repo, secrets live encrypted next to the code they configure, and nothing gets kubectl apply’d by hand. If the rack burned down, the cluster is an Ansible run and a Flux bootstrap away from existing again.

Foundation

  • k3s — all of Kubernetes, none of the ceremony
  • Lima + socket_vmnet — Linux VMs on macOS with real LAN IPs
  • Tailscale — the mesh every node and service rides on
  • Ansible — host prep, VM bootstrap, day-2 ops

GitOps & secrets

  • Flux CD — every manifest reconciled from git
  • SOPS + age — secrets encrypted in the repo, decrypted in-cluster
  • Infisical — runtime secrets for the apps themselves
  • Reloader — pods roll when config changes

Storage & data

  • Longhorn — volumes replicated across the minis, so any one Mac can die
  • Garage — the S3 API, but the bytes stay in the house
  • CloudNativePG — Postgres with backups and failover as manifests

Observability

  • Prometheus + Grafana — metrics, dashboards, custom alerts
  • Loki — logs
  • Blackbox exporter — uptime probes on the public endpoints
  • ntfy — alerts push straight to my phone

Edge & identity

  • Traefik — ingress
  • cert-manager + Cloudflare — TLS without thinking about it
  • oauth2-proxy — SSO in front of the private dashboards
  • Tailscale operator — internal services exposed to the tailnet, not the internet

Operability

  • Headlamp — cluster UI for when a terminal is the wrong tool
  • Flux webhooks — push to main, reconcile now, not on the next poll
Grafana cluster dashboard: an 'all systems go' banner, zero firing alerts, green subsystem checks, and per-node CPU, RAM, and disk for all three Macs and their k3s VMs
Grafana — the NOC dashboard. Click to read the panels.
Headlamp's cluster map: every namespace laid out as a block — monitoring, loki, tailscale, fountain, grocery-aid, guild, mem0, garage, bambuddy — with zero errors or warnings
Headlamp — the whole cluster as a map, filtered to errors: none.
Terminal output of flux get kustomizations: two dozen kustomizations from cert-manager to fountain, every one of them Ready
flux get kustomizations — everything Ready, nothing suspended.

What it actually runs

The lab isn’t an aquarium. Everything below serves real users — some of them human, some of them AI agents, one of them a 3D printer.

Fountain

The control plane for my fleet of sandboxed coding agents. The agents that build Fountain run on the cluster Fountain is deployed to.

Guild

Work management that connects ticketed work to agent runners — tickets go in, agents pick them up.

Convoy

A real-time strategy game you play by writing code.

Bambuddy

Manages the Bambu Lab 3D printer. Yes, the cluster runs the printer. Yes, the printer printed parts of the cluster’s rack. More below.

Grocery Aid

Shopping and meal planning for our household. The least glamorous app here, and the one with the most demanding stakeholder.

mem0

Open-source AI memory database, self-hosted so the agent fleet’s memory lives on hardware I own.

mcp-echo

A public MCP server that helps MCP creators debug what their clients are actually sending.

The Bambu Lab P1S mid-print: chamber light on, toolhead over the plate, a red print taking shape

The cluster has a hand in the physical world

Bambuddy runs in the cluster and drives the Bambu Lab printer. The printer, in turn, has printed hardware for the cluster: the power holder for the UniFi switch and the alignment racks the Mac minis sit in.

Which means the rack is partially self-hosting in a way software never gets to be — the infrastructure manufactured some of its own mounting hardware.

Decisions & scars

The choices that weren’t obvious, and the one that cost a debugging session.

Why Mac minis?

Because they were already here. These machines started life as workstations for my AI agent fleet, and Apple Silicon turned out to be a genuinely good cluster substrate: an M4 mini idles in single-digit watts, makes no noise, and fits three-wide on a shelf. If I were buying hardware from scratch for a cluster, I’d buy Linux boxes — but I wasn’t, and repurposing beats purchasing.

Why VMs on macOS instead of bare-metal Linux?

Because I couldn’t. Bare-metal Linux doesn’t support M4 Macs — if it did, these would be Linux boxes. So k3s gets the next best thing: one Linux VM per mini. I started on OrbStack and switched to Lima after hitting a networking issue I couldn’t work around. Lima with socket_vmnet gives each VM a bridged interface with a real LAN IP from the UniFi gateway’s DHCP, which turns out to be the detail everything else depends on.

Why three storage systems?

Because each answers a different question. Longhorn replicates block volumes across the minis, so any single Mac can die without taking data with it. Garage gives apps the standard S3 API while the bytes stay on hardware I own. CloudNativePG turns Postgres into manifests — provisioning, backups, failover — instead of a hand-fed database somewhere. The common thread: no workload gets to care which physical box it landed on.

The DERP detour

Tailscale makes everything reachable, which is exactly why it can hide a problem: when peers can’t connect directly, traffic silently falls back to relaying through Tailscale’s DERP servers. With the VMs behind Lima’s userspace NAT, that’s what happened — cluster traffic between two machines sitting a foot apart was round-tripping through a relay on the public internet. Everything worked, just worse than it should have. The fix is the socket_vmnet setup above: give the VMs real LAN addresses and WireGuard forms direct paths across the shelf. The lesson stuck — “it works” and “it works the way you think it does” are different claims, and only the second one survives load.

This is just how I work.

Nothing on this page was required. I run it this way because git-driven deploys, real monitoring, and encrypted secrets are habits — and habits come with the hire.