Skip to main content
  1. Posts/

Building a Robust Linux Backup Strategy with Borg

· David Steeman · Linux, Raspberry Pi

We have all heard the mantra: backups are important, test your backups, follow the 3-2-1 rule. But actually implementing a backup strategy that works reliably, alerts you when something goes wrong, and survives reboots is another matter entirely. This is the story of how I built — and continue to evolve — a comprehensive backup system for my homelab server.

Why I finally got serious about backups
#

The catalyst was a failing hard drive. My backup disk, a WD Elements 4.5TB USB HDD, started showing signs of age. It had been quietly handling daily backups for years, but I could see the writing on the wall. This was the wake-up call I needed to not just replace the disk, but to rethink the entire backup strategy.

The old setup was simple but fragile: a single script running rsync to an external disk. No monitoring, no offsite copies, no alerting when things went wrong. If a backup silently failed, I would not know until I needed to restore something.

I replaced the failing WD disk with a Seagate IronWolf 4TB SATA drive. For the offsite side, I repurposed an older drive I had been using for occasional manual offsite backups — it became the seed of an automated offsite backup strategy on a remote Raspberry Pi. More on that later.

Choosing the right tool: Borg Backup
#

After researching modern backup solutions, I settled on Borg Backup . Borg offers several key advantages over traditional tools like rsync or tar:

Deduplication — Borg identifies duplicate chunks of data across all backups and stores them only once. This means that if you have 7 daily backups, you are not using 7 times the space. In practice, my 70GB system backups only consume about 2% more space per additional backup.

Compression — Built-in compression (lz4, zstd, or zlib) reduces storage requirements without significant CPU overhead.

Encryption — Optional client-side encryption, though I opted out since my backup targets are physically secure.

Mountable archives — You can mount any backup archive as a filesystem and browse it like a regular directory, making file-level recovery trivial.

Efficient pruning — Built-in retention policies let you keep daily, weekly, and monthly backups with a single command.

Alternatives I considered were restic (similar feature set, but I found Borg’s documentation clearer) and good old rsync (no deduplication, no built-in pruning, harder to manage multiple point-in-time backups).

Storage architecture
#

The system I built backs up a Ubuntu server called john-ai that runs KVM virtual machines and systemd-nspawn containers. The storage layout looks like this:

Primary disk (NVMe, 932GB):

  • / (92GB) — System root
  • /home (366GB) — User data
  • /var (92GB) — Application data
  • /storage (1.3TB) — Bulk storage (libvirt images, containerd, Docker) — backed up via dedicated repos

Backup targets — the 3-2-1 approach:

  • USB HDD (Seagate IronWolf 4TB, SATA) mounted at /mnt/backup — Nightly backups (primary)
  • NAS (SMB/CIFS share) mounted at /mnt/nas — Passive mirror of USB repos for local redundancy
  • Offsite Raspberry Pi (remote, via Tailscale VPN) — Daily rsync mirror of USB repos for true offsite protection

The USB disk handles the frequent, fast backups. The NAS provides a local second copy for redundancy. The Raspberry Pi provides true offsite protection against fire, theft, or catastrophic failure of the entire building.

The offsite Raspberry Pi
#

The offsite component is a Raspberry Pi 4 Model B at a remote location, connected back to my network via Tailscale mesh VPN. It has an older WD 4.5TB USB HDD — a drive I had previously used for occasional manual offsite backups — now wiped and re-encrypted with LUKS full-disk encryption.

The Pi does not run Borg at all. Instead, a nightly rsync mirrors the USB Borg repositories directly from john-ai to the Pi. This keeps the Pi simple — it just needs rsync and SSH — while giving me byte-identical copies of every Borg archive offsite.

Encrypting the offsite disk — and the key problem
#

An offsite backup is wonderful right up until the disk is stolen, or the building it lives in is searched. So the Pi’s WD disk is fully encrypted with LUKS2. But encryption raises an awkward question: where do you keep the key?

I did not want the key sitting on the Pi’s SD card — then anyone who walks off with the whole Pi has both the lock and the key. Instead, the key lives only on john-ai. When the Pi boots, a systemd service connects back over Tailscale, fetches the key, and pipes it straight into cryptsetup — the key is never written to any filesystem on the Pi. If the Pi is stolen while powered off, the disk is just noise. And if I ever need a kill switch, revoking the Pi’s Tailscale access means it can never reach john-ai to unlock again.

The SSH key that john-ai uses to push backups is itself locked down with rrsync, a wrapper that restricts the key to rsync operations under a single directory. Even if someone extracted that key from john-ai, they could not get a shell on the Pi — only run a backup into one folder.

The disaster-recovery paradox
#

There is a subtle trap hiding in this design. The offsite disk can only be unlocked by fetching the key from john-ai — which is exactly the machine the offsite copy exists to survive losing. If my house burns down with john-ai in it, the surviving offsite disk is an encrypted brick, because the only key was in the ashes.

The fix is an offline recovery bundle: the raw keyfile, a backup of the LUKS header (without it, a single bad sector can make the disk unrecoverable even with the key), and a written step-by-step runbook for unlocking and restoring on any Linux machine. That bundle lives in my password manager and on a BitLocker-encrypted USB stick kept physically off-site — never inside the backup itself. I verified the whole chain actually works: the keyfile unlocks the header backup, and the runbook walks through a from-scratch restore with no dependency on john-ai.

What gets backed up
#

Not everything needs to be backed up. The key insight is to focus on irreplaceable data:

Included:

  • System root (/) — For disaster recovery
  • User data (/home) — Documents, code, configuration
  • Application data (/var) — Databases, service data
  • VM disk images (/storage/libvirt/images/) — Virtual machine storage
  • Container storage (/var/lib/machines/) — systemd-nspawn machines
  • Containerd data (/storage/containerd/) — Container runtimes
  • Docker data (/storage/docker/) — Including Nextcloud database dumps

Excluded:

  • /storage (from system backup — containerd and Docker have their own repos)
  • /var/lib/machines (from system backup — containers have their own repo)
  • /var/log — Transient log files that cause backup errors
  • /proc, /sys, /dev, /run, /tmp — Virtual filesystems
  • /mnt, /media — Mount points for other filesystems

Intentionally not backed up:

  • /storage/ollama/ (128 GB) — AI models, re-downloadable
  • /storage/fileshare/ (178 MB) — Non-critical user files
  • /mnt/backup/frigate/ (201 GB) — NVR footage, transient

Excluding /var/log was a practical decision. Log files are constantly being written to, which causes Borg to report “file changed during backup” warnings. Since logs are transient and not critical for recovery, excluding them eliminated these errors.

The backup repositories
#

Each category of data gets its own Borg repository, which keeps pruning and monitoring independent:

RepositoryPathPurpose
System/mnt/backup/john-aiNightly system backup
VMs/mnt/backup/john-ai-vmsNightly VM disk backups
Containers/mnt/backup/john-ai-containersNightly container backups
Containerd/mnt/backup/john-ai-containerdNightly containerd data
Docker/mnt/backup/john-ai-dockerNightly Docker data

All five repositories are mirrored to the NAS and synced to the offsite Pi.

Retention strategy
#

The 3-2-1 backup rule says you should have 3 copies of data, on 2 different media types, with 1 offsite. My strategy enforces this:

Backup TypeDailyWeeklyMonthly
All repos (USB)326
NAS mirrorMirror of USB
Offsite PiMirror of USB

The USB disk holds the canonical copies with 3 daily, 2 weekly, and 6 monthly snapshots. The NAS and offsite Pi are passive mirrors — they get exactly what is on USB. This means I have 3 complete copies: one on USB, one on the NAS, and one offsite.

Beyond system files: VMs and containers
#

The server runs several virtual machines and containers that need their own backup strategy:

Virtual Machines (KVM):

  • Home Assistant — Nightly backup of the VM disk image
  • Claude Code development VM — Nightly
  • OpenClaw development VM — Nightly

Containers (systemd-nspawn):

  • AdGuard DNS — Nightly backup
  • Minecraft server — Nightly

VM disk images are backed up while the VMs are running, thanks to Borg’s file-level deduplication. The nightly script also dumps each VM’s XML definition via virsh dumpxml, so restoring a VM is a matter of redefining it and pointing at the disk image.

The system also integrates with my Proxmox backup infrastructure. The backup-status script monitors Proxmox VM backups stored on the NAS, giving me a unified view of all backup health across both my KVM host and the Proxmox server.

Monitoring and alerting
#

A backup system that silently fails is worse than no backup system at all. I built several layers of monitoring:

Failure alerts — Every backup script sends an email if it exits with a non-zero code. The email includes the exit code, the full Borg output, and suggested remediation steps.

Terminated-abnormally trap — The nightly script runs under set -uo pipefail with an EXIT/signal trap. If it is killed by OOM, a power loss, a USB disconnect, or a stray kill, the trap fires and sends an alert instead of dying silently.

Mount checks — Before running, each script verifies that the backup target is mounted. The nightly script also checks that source partitions (/home, /var, /storage) are actually mounted — it refuses to back up an empty mountpoint, which would otherwise cause pruning to delete good archives.

Staleness detection — A script runs every 6 hours to check that backups are fresh. If the nightly backup is older than 30 hours, or the offsite sync is older than 30 hours, an alert is sent. This catches situations where cron jobs silently stop running.

Disk space warnings — The nightly backup checks if /mnt/backup is 90% or more full and emails a non-fatal warning, so I can act before space runs out.

Weekly summary — Every Monday at 8 AM, I receive a comprehensive email showing the status of all backups, their sizes, and the current schedule. This serves as a regular health check and reminder that the system is working.

Status dashboard — A backup-status command displays the current state of all backups in a formatted terminal dashboard, including archive counts, sizes, freshness, and any issues detected.

Offsite dead-man’s switch — This is the alarm I hope I never need. After every successful offsite sync, john-ai writes a heartbeat file to the Pi — so a fresh heartbeat means the offsite copy is genuinely up to date, not merely that the local backup ran. A watchdog script on the Pi checks it hourly. If the heartbeat goes stale (older than 30 hours), the Pi sends an alert from its own machine and mail path — catching total failures that john-ai itself could never report (powered off, cron dead, email broken). That same watchdog also warns me pre-emptively if the Pi’s own disk passes 95% full, before a sync can fail for lack of space.

Email notifications use msmtp, a lightweight SMTP client that sends mail through my email provider. The configuration stores credentials securely and handles TLS encryption.

Integrity and restore testing
#

Monitoring that backups ran is not enough — I need to know they actually work:

Weekly integrity check — Every Saturday at 9 AM, borg check --repository-only runs across all repositories. This is a fast structural check that catches segment and index corruption early.

Monthly deep verification — Once a month, a slower borg check --verify-data reads and cryptographically verifies every data chunk in every repository. The weekly check confirms the repository structure is sound; this one confirms the actual backed-up file data has not silently rotted on disk — the kind of bit-rot a structural check cannot see.

Monthly restore spot-check — On the 1st of each month, an automated script extracts a known file from the latest system archive and compares it to the live version. This proves the archive is actually extractable and the data is intact. If the test fails, I get an alert.

These checks surface in the backup-status dashboard alongside everything else, so the Monday summary email always tells me the state of both integrity and restore tests.

How I get alerted
#

All alerts are delivered via email using msmtp. I designed the notification system so that silence means everything is fine — and any email from the backup system demands attention. There are three categories of notifications:

Error alerts
#

When something goes wrong, I get an email immediately with the subject line prefixed [BACKUP] and a clear description. The email includes the exit code, the relevant log output, and suggested remediation steps so I can act without digging through logs.

Here is what a backup failure looks like:

From: john-ai <me@example.com>
Subject: [BACKUP] john-ai NIGHTLY BACKUP FAILED

=== NIGHTLY BACKUP FAILED ===
Exit code: 1
Timestamp: 2026-06-12 00:23:15

--- System backup ---
  Status: OK (10m 32s)

--- VM backup: 100 homeassistant ---
  Status: OK (2m 18s)

--- VM backup: 102 openclaw ---
  Status: FAILED (exit code 1)
  Output:
    borg create: Error: /storage/libvirt/images/102-openclaw: file changed
    while we backed it up

--- Retention + compact ---
  Status: OK

Full log: /var/log/backup-nightly.log

This tells me exactly which VM failed and why — I can see that openclaw had a file-change warning during backup, while everything else succeeded. I know to check that VM’s disk without wading through logs for the other sections.

Other error alerts follow the same pattern but with different subjects:

  • [BACKUP] john-ai NIGHTLY BACKUP ABORTED — A pre-flight check failed (disk not mounted, source partition missing, Borg not installed). The backup did not even start.
  • [BACKUP] john-ai TERMINATED ABNORMALLY — The script was killed mid-run (OOM, power loss, USB disconnect). This comes from the EXIT trap, not from normal error handling.
  • [BACKUP] john-ai OFFSITE SYNC FAILED — The rsync to the Pi or NAS mirror failed. The email identifies which specific repository failed.
  • [BACKUP] john-ai INTEGRITY CHECK FAILED — Weekly borg check found repository corruption.
  • [BACKUP] john-ai RESTORE TEST FAILED — Monthly spot-check could not extract from the latest archive.

Warning alerts
#

Warnings are non-fatal issues that I should know about but that do not indicate a failed backup:

From: john-ai <me@example.com>
Subject: [BACKUP] john-ai DISK SPACE WARNING

=== DISK SPACE WARNING ===
/mnt/backup is 91% full (2.9T of 3.2T used)
Backup completed successfully, but disk space is running low.
Consider pruning older archives or replacing the disk.

Staleness alerts arrive when a backup has not run within the expected window:

From: john-ai <me@example.com>
Subject: [BACKUP] john-ai STALENESS ALERT

=== BACKUP STALENESS ALERT ===
Nightly backup age: 34h (threshold: 30h)
Offsite sync age: 8h (OK)

The nightly backup has not completed in over 30 hours.
Check cron jobs and /var/log/backup-nightly.log.

And from the offsite Pi, the dead-man’s switch sends its own alert if john-ai stops checking in entirely:

From: offsite-backup <me@example.com>
Subject: [WATCHDOG] john-ai heartbeat STALE

=== john-ai HEARTBEAT STALE ===
Last heartbeat: 35h ago (threshold: 30h)
Heartbeat file: /mnt/backup/.heartbeat/john-ai-last-success

john-ai has not reported a successful backup in over 30 hours.
This could mean: powered off, cron dead, network down, or email broken.

Notice that this last email comes from the Pi, not from john-ai. That is intentional — if john-ai is completely dead (power off, cron stopped, msmtp broken), it cannot send any alerts. The Pi is an independent watchdog that alerts from its own mail path.

Weekly summary
#

Every Monday at 8 AM, I receive an HTML-formatted status report. This is my regular health check — a quick scan confirms everything is running smoothly without me having to log into the server.

The summary includes the same information as the terminal dashboard: storage status, all repository sizes and archive counts, VM and container freshness, and the self-monitoring section showing the last integrity check and restore test results. If everything is healthy, the subject line says [BACKUP] john-ai Weekly Summary — All healthy. If there are issues, they are highlighted in the subject.

The schedule
#

All backup jobs are defined in a single cron file /etc/cron.d/john-ai-backups:

JobScheduleDescription
Nightly backup00:00 dailySystem, VMs, containers, containerd, and Docker to USB
Offsite + NAS sync02:00 dailyrsync USB repos to Pi and NAS
Staleness checkEvery 6 hoursVerify backups are fresh
Integrity check09:00 Saturdayborg check --repository-only — structural check across all repos
Deep verification03:00 on the 15thborg check --verify-data — verify every data chunk
Restore spot-check10:00 on the 1stExtract and verify a file from the latest archive
Weekly summary08:00 MondayEmail status report

Running backups at midnight minimizes impact on system performance during active hours. The offsite sync runs at 2 AM, after the nightly backup finishes (it polls for the backup lock to be released before starting). The staleness check runs frequently enough to catch problems quickly, but not so often that it becomes noisy.

Reboot survival
#

A backup system is only reliable if it survives reboots. Here is what makes the configuration persistent:

ComponentMechanism
USB mount/etc/fstab entry with UUID and nofail option
NAS mount/etc/fstab entry with _netdev option for network dependency
Netwerkdrive mount/etc/fstab (read-only SMB/CIFS)
Tailscale VPNAuto-starts via systemd, persists across reboots
SSH config for Pi/root/.ssh/config with offsite-backup host alias
Pi LUKS auto-unlocksystemd timer at boot fetches key from john-ai via Tailscale
Backup scriptsInstalled in /usr/local/bin/
Shared library/usr/local/lib/backup-lib.sh
Cron jobs/etc/cron.d/john-ai-backups
msmtp config/etc/msmtprc with proper permissions
Log rotation/etc/logrotate.d/backup-logs for all backup log files

The nofail option for the USB mount ensures the system boots even if the backup disk is disconnected. The _netdev option for the NAS ensures the mount happens after the network is available.

The result
#

The final system provides:

  • Nightly backups of system, VMs, containers, containerd, and Docker to local USB storage
  • Daily offsite sync to a remote Raspberry Pi over Tailscale VPN
  • NAS mirror for local redundancy
  • Automated retention with configurable policies (3 daily / 2 weekly / 6 monthly)
  • LUKS-encrypted offsite disk with the key fetched at boot, plus an offline recovery bundle for true disaster recovery
  • Multi-layered alerting for failures, staleness, disk space, integrity, and weekly status
  • Integrity checks (weekly structural + monthly deep --verify-data) and automated restore testing to prove backups work
  • Offsite dead-man’s switch that alerts even when the main server cannot
  • Unified dashboard for checking backup health at a glance

The peace of mind is substantial. I know that if a disk fails, a file is accidentally deleted, or the entire system is lost, I have multiple recovery options ranging from hours old to months old — and I will always be notified if something goes wrong.

Under the hood: script highlights
#

The core of the system is the unified nightly backup script. Here are the key parts:

Pre-flight checks ensure the backup disk is mounted and source partitions are valid:

# Pre-flight: verify backup disk is mounted
if ! mountpoint -q /mnt/backup; then
  echo "ERROR: /mnt/backup is not mounted!"
  # Attempt mount, alert on failure...
fi

# Verify source partitions are actually mounted (not empty mountpoints)
for src in /home /var /storage; do
  if ! mountpoint -q "$src"; then
    echo "FATAL: $src is not mounted — refusing to back up empty mountpoint"
    # Alert and exit...
  fi
done

System backup uses Borg with compression, per-operation timeout, and exclusion patterns:

timeout "$BORG_TIMEOUT" borg create \
  --compression zstd \
  --exclude-caches \
  --exclude '/proc' \
  --exclude '/sys' \
  --exclude '/dev' \
  --exclude '/run' \
  --exclude '/tmp' \
  --exclude '/mnt' \
  --exclude '/media' \
  --exclude '/storage' \
  --exclude '/var/lib/machines' \
  --exclude '/var/log' \
  --stats \
  "$REPO::nightly-{now:%Y-%m-%dT%H:%M}" \
  / /home /var

Retention pruning with borg compact to actually reclaim disk space:

borg prune "$REPO" \
  --keep-daily=3 \
  --keep-weekly=2 \
  --keep-monthly=6

# Reclaim freed space (required since Borg 1.2)
borg compact "$REPO"

Offsite sync mirrors USB repos to both Pi and NAS:

# Sync to offsite Pi via Tailscale. The Pi's SSH key is rrsync-restricted to
# /mnt/backup, so the remote paths are relative to that root.
for REPO in john-ai john-ai-vms john-ai-containers john-ai-containerd john-ai-docker; do
  rsync -av --delete --hard-links --numeric-ids \
    /mnt/backup/$REPO/ offsite-backup:$REPO/
done

# Mirror to NAS (CIFS) — size-only because CIFS can't preserve mtimes
rsync -av --delete --size-only \
  /mnt/backup/ /mnt/nas/backups/

The --size-only flag for the NAS mirror deserves explanation. The NAS is mounted via CIFS/SMB, which cannot preserve file modification times. Without --size-only, rsync would see every file as changed and re-transfer the entire ~430 GB every night. Since Borg segments are immutable, matching size means matching content — so --size-only is both safe and efficient.

The backup-status dashboard
#

Running sudo backup-status gives an at-a-glance view of the entire backup situation:

╔════════════════════════════════════════════════════════════════════════════════╗
║                          JOHN-AI BACKUP STATUS DASHBOARD                       ║
╚════════════════════════════════════════════════════════════════════════════════╝
Generated: 2026-06-12 06:00:15

┌─ STORAGE STATUS ──────────────────────────────────────────────────────────────┐
│ USB Backup Disk (/mnt/backup): MOUNTED
│   Size: 3.6T | Used: 290G (8%) | Available: 3.1T
│ NAS (/mnt/nas):               MOUNTED
│   Size: 1000G | Used: 472G (48%) | Available: 529G
└───────────────────────────────────────────────────────────────────────────────┘

┌─ NIGHTLY BACKUP (USB) ────────────────────────────────────────────────────────┐
│ System (/mnt/backup/john-ai):
│   Archives: 11 | Last backup: 6h
│   Original: 2.88 TB | Deduplicated: 1.98 TB | Compressed: 132 GB
│ VMs (/mnt/backup/john-ai-vms):
│   Archives: 9 | Last backup: 6h
│ Containers (/mnt/backup/john-ai-containers):
│   Archives: 9 | Last backup: 6h
│ Containerd (/mnt/backup/john-ai-containerd):
│   Archives: 8 | Last backup: 6h
│ Docker (/mnt/backup/john-ai-docker):
│   Archives: 8 | Last backup: 6h
└───────────────────────────────────────────────────────────────────────────────┘

┌─ NAS MIRROR ──────────────────────────────────────────────────────────────────┐
│ Status: Synced 4h ago (via backup-offsite-sync.sh)
│ Repos mirrored: john-ai, john-ai-vms, john-ai-containers,
│                 john-ai-containerd, john-ai-docker
└───────────────────────────────────────────────────────────────────────────────┘

┌─ VM BACKUPS ───────────────────────────────────────────────────────────────────┐
│ 100 homeassistant: 6h (healthy)
│ 101 claudecode:    6h (healthy)
│ 102 openclaw:      6h (healthy)
└───────────────────────────────────────────────────────────────────────────────┘

┌─ CONTAINER BACKUPS ────────────────────────────────────────────────────────────┐
│ 101 adguard:    6h (healthy)
│ 102 minecraft:  6h (healthy)
└───────────────────────────────────────────────────────────────────────────────┘

┌─ SELF-MONITORING & INTEGRITY ─────────────────────────────────────────────────┐
│ Last nightly:        SUCCESS (2026-06-12 00:42)
│ Last integrity check: OK (2026-06-07 09:15)
│ Last restore test:   PASS (2026-06-01 10:08)
│ Offsite heartbeat:   Fresh (3h ago)
│ Warnings:            None
└───────────────────────────────────────────────────────────────────────────────┘

┌─ SUMMARY ──────────────────────────────────────────────────────────────────────┐
│ All backups healthy! No issues detected.
└───────────────────────────────────────────────────────────────────────────────┘

The deduplication statistics tell an important story: 2.88 TB of original data compressed down to just 132 GB on disk. That is a 95% space savings, meaning I can keep far more backup history than a naive copy would allow.

Evolution and lessons learned
#

If I were starting from scratch today, here is what I would do differently:

Start with offsite from day one. My original setup had only local USB backups. It took a failing disk to make me think about offsite copies, and it took another iteration to realize that “offsite” means physically remote — not just a NAS on the same LAN.

Invest in monitoring early. The first version had no alerting. Backups failed silently. The multi-layered monitoring I have now (failure alerts, staleness checks, dead-man’s switch, integrity checks, restore tests) took several iterations to build, but each layer has caught real problems.

Test restores regularly. I used to assume that if borg create succeeded, the backup was good. The monthly restore spot-check has already caught one issue where a changed file path made the test fail — a problem I would not have discovered until I actually needed to restore.

Keep the offsite target simple. The Raspberry Pi runs no Borg, no databases, no complex services. It just receives rsync pushes and runs a simple watchdog cron job. This simplicity means it almost never breaks.

Encrypt the offsite copy — but plan how you will get the key back. Encrypting the remote disk is the easy part. The hard part is realising that if the only key lives on the machine the offsite backup is meant to survive, you have built a brick, not a backup. An offline, off-site copy of the key and the LUKS header closes that loop — and it is worth testing the restore from that copy before you ever need it for real.

Resources
#