McGarrah Technical Blog

Hybrid Ceph Storage: SSD WAL/DB Acceleration with USB Drive Data

· 16 min read

Running Ceph in a homelab means making tradeoffs. You want distributed storage — high availability, scalability, data protection — but enterprise hardware costs spiral fast. My answer: put the brains on SSD and the bulk on cheap USB drives.

After building a 15-OSD Ceph cluster with 69 TiB of raw storage across five Dell OptiPlex 990 nodes, I’ve learned that separating the WAL (Write-Ahead Log) and DB (RocksDB metadata) onto fast SSDs while keeping bulk data on 5TB USB drives delivers excellent performance at a fraction of all-SSD costs.

If you read my recent post about discovering WAL vs DB inconsistencies across the cluster, this is the deeper explanation of what those terms mean, why the separation matters, and how to set it up.

The Theory: Why Separate WAL and DB?

Ceph’s BlueStore backend has three types of data with very different I/O patterns:

By default, all three live on the same device. On a USB HDD, that means every metadata lookup and every write journal entry competes with bulk data I/O on a drive that’s already slow (USB 3.0 caps at ~125 MB/s, and HDDs add seek latency on top).

Separating WAL and DB onto an SSD means:

WAL vs DB: Which Should You Use?

This is the question I didn’t ask carefully enough when building the cluster, which led to the inconsistency I discovered across nodes.

WAL Only (--block.wal)

DB (--block.db)

The Recommendation

Use DB. It’s what the Proxmox Web UI creates by default, it provides more acceleration for the same SSD space, and there’s no downside. The only reason to use WAL-only is if you’re extremely constrained on SSD capacity and want to dedicate every byte to write journaling.

In my cluster, the nodes created via the Proxmox UI (edgar, poe) got DB, while the nodes created via CLI (harlan, kovacs, quell) got WAL-only. Both work fine, but future rebuilds will standardize on DB.

What Are WAL-Only Nodes Actually Losing?

With 9 of 15 OSDs running WAL-only, the performance gap is worth understanding. The difference comes down to where RocksDB metadata reads happen:

For a cluster with 2.94M objects, that’s a lot of RocksDB lookups. The impact shows up in:

You can compare OSD latency between WAL-only and DB nodes with:

# Compare apply latency across OSDs (lower is better)
for osd in 0 3 6 1 4 7; do
  LAT=$(ceph osd perf 2>/dev/null | grep "^\s*$osd" | awk '{print $3}')
  HOST=$(ceph osd metadata $osd 2>/dev/null | grep '"hostname"' | awk -F'"' '{print $4}')
  TYPE="WAL"
  ceph osd metadata $osd 2>/dev/null | grep -q bluefs_db && TYPE="DB"
  echo "osd.$osd  $HOST  $TYPE  apply_latency: ${LAT}ms"
done

I haven’t run a formal A/B comparison yet — designing a test that isolates the WAL vs DB variable without disrupting a live cluster with 10 TiB of data is tricky. The OSDs serve different PGs with different access patterns, so raw latency numbers aren’t directly comparable. A proper test would require creating matched OSDs on the same node with identical data, which means temporarily destroying and recreating an OSD. That’s a project for a maintenance window, not a Tuesday afternoon. If I get to it, I’ll publish the results as a follow-up.

My Hardware Setup

The AlteredCarbon Cluster

Node Role CPU RAM Boot Ceph SSD Ceph Data
harlan OSD host i7-2600 31 GB 2x 128GB SSD (ZFS mirror) CT500MX500SSD1 (500GB) 3x 5TB Seagate USB
kovacs OSD host i7-2600 31 GB 2x 149GB HDD (ZFS mirror) CT500MX500SSD1 (500GB) 3x 5TB Seagate USB
poe OSD host i7-2600 31 GB 2x 149GB HDD (ZFS mirror) CT500MX500SSD1 (500GB) 3x 5TB Seagate USB
edgar OSD host i7-2600 31 GB 2x 931GB HDD (ZFS mirror) CT500MX500SSD1 (500GB) 3x 5TB Seagate USB
quell OSD host i7-2600 31 GB 2x 128GB SSD (ZFS mirror) CT500MX500SSD1 (500GB) 3x 5TB Seagate USB
tanaka Monitor only i5-2400 16 GB 2x 465GB HDD (ZFS mirror)

15 OSDs across 5 hosts. 69 TiB raw, 39 TiB available with 3x replication. Each OSD host has one Crucial MX500 500GB SATA SSD partitioned into three 100GB LVs for WAL/DB acceleration.

The cluster hasn’t always been this size. Kovacs lost two USB drives during a power failure in September 2025 and ran with a single OSD for months before replacement drives restored it to three. Poe was added as the fifth OSD host during the same period. The hardware table above reflects the current state.

SSD Selection: Crucial MX500

The MX500 was chosen for:

Each 500GB MX500 is partitioned into three 100GB LVM logical volumes, one per OSD. The remaining ~165GB is unallocated — available for wear leveling overhead or a future fourth OSD if I ever find a way to squeeze another USB drive onto a node.

USB Drive Selection: Seagate 5TB Portables

The Ceph data drives are a mix of Seagate models:

All require USB storage quirks for SMART monitoring and stable operation:

echo 'options usb-storage quirks=0bc2:ac2b:,0bc2:ac41:,0bc2:2344:,0bc2:ab9a:' > /etc/modprobe.d/usbstorage-quirks.conf
update-initramfs -u

Sizing WAL and DB Partitions

The Ceph documentation recommends:

For a 5TB OSD:

My cluster uses 100GB per OSD for both WAL-only and DB configurations. This is slightly undersized for the DB recommendation but works well in practice because the actual data stored per OSD (~2-2.5 TiB with 3x replication) generates less metadata than the raw capacity would suggest.

What Happens If DB Is Too Small?

If the RocksDB metadata outgrows the DB device, Ceph spills the overflow onto the data device (the USB HDD). Performance degrades gracefully — you don’t lose data, you just lose the acceleration for the spilled portion. Monitor with:

ceph daemon osd.X perf dump | grep -i bluefs

Creating Hybrid OSDs

Proxmox → Node → Ceph → OSD → Create: OSD

The UI creates the LVM logical volume on the SSD automatically. This creates a DB configuration (recommended).

Method 2: CLI with Pre-Sized LV

For more control, or to force a specific OSD ID:

# Create a 100GB LV on the MX500's VG
lvcreate -L 100G -n osd-db-osd6 ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c

# Create the OSD with DB on the pre-sized LV
ceph-volume lvm create --osd-id 6 --data /dev/sdd \
  --block.db ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c/osd-db-osd6

The --osd-id flag is useful when replacing a failed drive and you want to keep the same OSD number for physical labeling. I used this approach when replacing osd.6 on harlan after a USB drive failure.

Method 3: CLI with WAL Only (Legacy)

lvcreate -L 100G -n osd-wal-osd6 ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c

ceph-volume lvm create --osd-id 6 --data /dev/sdd \
  --block.wal ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c/osd-wal-osd6

This is how the original harlan, kovacs, and quell OSDs were created. It works but provides less acceleration than DB.

Verifying the Configuration

After creation, verify the OSD has the SSD device attached:

ceph osd metadata osd.6 | grep -E '"id"|"hostname"|bluefs_wal|bluefs_db|bluestore_bdev'

For a cluster-wide view:

for osd in $(ceph osd ls | sort -n); do
  HOST=$(ceph osd metadata $osd 2>/dev/null | grep '"hostname"' | awk -F'"' '{print $4}')
  TYPE="WAL"
  ceph osd metadata $osd 2>/dev/null | grep -q bluefs_db && TYPE="DB"
  echo "osd.$osd  $HOST  $TYPE"
done

Performance Results

What the SSD Acceleration Changes

I don’t have formal before/after benchmarks — I added the MX500 SSDs when building the OSDs, not as a retrofit. But the improvement is obvious in daily use, and the theory explains why: the SSD eliminates HDD seek latency for the operations that happen most frequently (metadata lookups and write journaling), while the USB 3.0 bus (~125 MB/s) remains the bottleneck for bulk data.

Observed improvements with SSD WAL/DB:

Where It Doesn’t Help

Current Cluster Performance

root@harlan:~# ceph -s
  data:
    pools:   4 pools, 577 pgs
    objects: 2.94M objects, 10 TiB
    usage:   30 TiB used, 39 TiB / 69 TiB avail

The cluster handles 10 TiB of stored data (30 TiB with 3x replication) across CephFS for media storage and RBD for VM block devices. The hybrid configuration keeps metadata operations responsive even during Ceph recovery and rebalancing events.

Cost Analysis

Per-OSD Cost Comparison

Configuration Drive Cost SSD Cost Total per OSD
Hybrid (USB + SSD WAL/DB) $100-129 (5TB USB) ~$17-20 (1/3 of MX500) $117-149
All-SSD $400-600 (5TB enterprise SSD) included $400-600
USB only (no SSD) $100-129 (5TB USB) $0 $100-129

Full Cluster Cost

For the 15-OSD AlteredCarbon cluster:

Configuration Total Cost Performance
Hybrid (actual) ~$1,800-2,200 Excellent for homelab
All-SSD equivalent ~$6,000-9,000 Overkill for USB bandwidth
USB only ~$1,500-1,900 Sluggish metadata

The hybrid approach adds ~$300 (5x MX500 SSDs) to the USB-only cost and delivers the majority of the SSD performance benefit. The all-SSD option would be wasted on USB 3.0 bandwidth — you’d pay 4x more and still be bottlenecked by the bus.

When the Math Changes

The hybrid approach stops making sense when:

Operational Considerations

SSD Wear Monitoring

The MX500 SSDs handle constant WAL/DB writes. Monitor wear with:

smartctl -A /dev/sdc | grep -E "Wear_Leveling|Media_Wearout|Available_Reservd"

At 100GB per OSD with 3 OSDs per SSD, the write amplification is modest. My MX500s show minimal wear after months of operation. Budget for replacement every 3-5 years depending on write volume.

USB Drive Health

USB Ceph drives fail in annoying ways — hung USB bridges that block every command, drives that disappear from the bus entirely, and enclosures that enumerate for half a second then disconnect.

Essential monitoring:

smartctl -H /dev/sdd -d sat,12

See Enabling SMART Monitoring on Seagate USB Drives and USB Drive SMART Updates for the full setup including USB quirks and GRUB configuration.

Label your drives. When a drive fails, knowing which physical USB cable corresponds to which OSD saves significant debugging time.

The Single SSD Risk

The biggest risk of this architecture: one SSD failure takes out three OSDs simultaneously. Every OSD on that node loses its WAL/DB device and goes down at once. On a node with three 5TB OSDs, that’s ~13.6 TiB of raw capacity disappearing in an instant.

We mitigate this with cluster design. The AlteredCarbon cluster has 5 OSD hosts with 3 OSDs each. Ceph’s CRUSH map distributes replicas across hosts, so losing an entire node (all 3 OSDs) still leaves 2 copies of every object on the remaining 4 hosts. The cluster continues serving data in a degraded state while you replace the SSD.

This is the same failure domain as losing the node itself — a power supply failure, motherboard death, or even accidentally unplugging the wrong power cable would have the same effect. The SSD doesn’t make the failure worse, it just adds another component that can trigger it.

If you only have 3 OSD hosts, this risk is more serious. Losing one node’s SSD means losing one-third of your OSDs, and with min_size=2 you’d be one more failure away from data unavailability. Four or more OSD hosts is the minimum I’d recommend for this architecture.

SSD Failure Recovery

If the WAL/DB SSD fails, the affected OSDs will go down. Recovery depends on the failure mode:

SSD dies, data drives intact:

# The OSD data is still on the USB drive
# Replace the SSD, create new LVs, and recreate the OSDs
# Ceph will rebuild WAL/DB from the data on the USB drive
systemctl stop ceph-osd@0 ceph-osd@3 ceph-osd@6
# Replace SSD, partition, create LVs
# Recreate OSDs pointing to existing data drives with new WAL/DB LVs

USB drive dies, SSD intact:

The WAL/DB on the SSD is useless without the data drive. Remove the dead OSD, clean up the orphaned LV on the SSD, and add a replacement drive. This is exactly what I did when osd.6’s Seagate BUP Portable died.

Rebalancing with Hybrid OSDs

During Ceph rebalancing, the SSD acceleration helps significantly. Metadata operations that drive the rebalancing process (peering, PG migration decisions) run at SSD speed, even though the actual data movement is limited by USB bandwidth. This means rebalancing starts faster and tracks progress more efficiently, even if the bulk data transfer rate is the same.

Lessons Learned

  1. Use DB, not WAL-only. The Proxmox UI defaults to DB for good reason. It accelerates both writes and metadata reads for the same SSD space.

  2. 100GB per OSD is a good starting point. For 5TB data drives with typical homelab workloads (media, backups, VMs), 100GB DB partitions haven’t spilled to the HDD.

  3. One SSD per node, partitioned for all OSDs. A single 500GB MX500 handles three OSDs comfortably. No need for one SSD per OSD.

  4. The SSD doesn’t fix the USB bottleneck. Large sequential I/O is still limited by USB 3.0. The SSD fixes the metadata bottleneck, which is what makes the cluster feel responsive.

  5. Label everything. Physical drive labels with OSD numbers save hours during failures. I learned this the hard way during the April 2026 osd.6 replacement.

  6. Monitor SSD wear and USB health separately. They fail in different ways — SSDs wear out gradually (SMART attributes), USB drives fail suddenly (bus disconnects, bridge hangs).

  7. Document your creation method. Whether you used the Proxmox UI (DB) or CLI (WAL), knowing which method created each OSD explains configuration differences when you’re debugging at 2 AM.

References

Categories: proxmox, ceph, homelab, storage

About the Author: Michael McGarrah is a Cloud Architect with 25+ years in enterprise infrastructure, machine learning, and system administration. He holds an M.S. in Computer Science (AI/ML) from Georgia Tech and a B.S. in Computer Science from NC State University, and is currently pursuing an Executive MBA at UNC Wilmington. LinkedIn · GitHub · ORCID · Google Scholar · Resume