Hybrid Ceph Storage: SSD WAL/DB Acceleration with USB Drive Data

By Michael McGarrah · April 16, 202616 min read

Running Ceph in a homelab means making tradeoffs — the same cost-performance-resilience triangle that drives every storage architecture decision. You want distributed storage — high availability, scalability, data protection — but enterprise hardware costs spiral fast. My answer: put the brains on SSD and the bulk on cheap USB drives.

After building a 15-OSD Ceph cluster with 69 TiB of raw storage across five Dell OptiPlex 990 nodes, I’ve learned that separating the WAL (Write-Ahead Log) and DB (RocksDB metadata) onto fast SSDs while keeping bulk data on 5TB USB drives delivers excellent performance at a fraction of all-SSD costs.

If you read my recent post about discovering WAL vs DB inconsistencies across the cluster, this is the deeper explanation of what those terms mean, why the separation matters, and how to set it up.

The Theory: Why Separate WAL and DB?

Ceph’s BlueStore backend has three types of data with very different I/O patterns:

Data — the actual objects you store. Large sequential writes and reads. Tolerates slow storage.
WAL (Write-Ahead Log) — a journal of pending writes. Small, constant, latency-sensitive. Every write hits the WAL first.
DB (RocksDB metadata) — the index that maps object names to locations. Small random reads on every operation. Heavily accessed during scrubs, recovery, and listing operations.

By default, all three live on the same device. On a USB HDD, that means every metadata lookup and every write journal entry competes with bulk data I/O on a drive that’s already slow (USB 3.0 caps at ~125 MB/s, and HDDs add seek latency on top).

Separating WAL and DB onto an SSD means:

Writes complete faster — the WAL acknowledges writes at SSD speed, then data flushes to the HDD in the background
Metadata lookups are instant — directory listings, object lookups, and scrub operations hit the SSD instead of waiting for HDD seeks
Recovery is faster — when an OSD comes back after a failure, the metadata operations that drive peering and backfill run at SSD speed

WAL vs DB: Which Should You Use?

This is the question I didn’t ask carefully enough when building the cluster, which led to the inconsistency I discovered across nodes.

WAL Only (`--block.wal`)

SSD stores only the Write-Ahead Log
RocksDB metadata lives on the HDD with the data
Accelerates writes but metadata reads still hit the HDD
Created via CLI: ceph-volume lvm create --block.wal /dev/ssd-partition

DB (`--block.db`)

SSD stores both the RocksDB metadata and the WAL
Accelerates writes (WAL) and metadata lookups (DB)
Strictly better than WAL-only if you have the SSD space
Created via Proxmox UI (default) or CLI: ceph-volume lvm create --block.db /dev/ssd-partition

The Recommendation

Use DB. It’s what the Proxmox Web UI creates by default, it provides more acceleration for the same SSD space, and there’s no downside. The only reason to use WAL-only is if you’re extremely constrained on SSD capacity and want to dedicate every byte to write journaling.

In my cluster, the nodes created via the Proxmox UI (edgar, poe) got DB, while the nodes created via CLI (harlan, kovacs, quell) got WAL-only. Both work fine, but future rebuilds will standardize on DB.

What Are WAL-Only Nodes Actually Losing?

With 9 of 15 OSDs running WAL-only, the performance gap is worth understanding. The difference comes down to where RocksDB metadata reads happen:

DB nodes (edgar, poe): metadata reads hit the MX500 SSD — sub-millisecond latency
WAL-only nodes (harlan, kovacs, quell): metadata reads hit the USB HDD — 5-15ms seek latency per operation

For a cluster with 2.94M objects, that’s a lot of RocksDB lookups. The impact shows up in:

Scrub operations — each object’s checksum requires a metadata lookup. WAL-only OSDs scrub slower because every lookup waits for an HDD seek.
Recovery and backfill — peering decisions require reading metadata for every PG. DB nodes peer faster after a restart.
CephFS directory listings — listing a directory with thousands of files triggers metadata reads for each entry. DB nodes respond noticeably faster.
OSD startup time — the OSD replays its RocksDB on startup. DB nodes boot faster because the replay reads from SSD.

You can compare OSD latency between WAL-only and DB nodes with:

# Compare apply latency across OSDs (lower is better)
for osd in 0 3 6 1 4 7; do
  LAT=$(ceph osd perf 2>/dev/null | grep "^\s*$osd" | awk '{print $3}')
  HOST=$(ceph osd metadata $osd 2>/dev/null | grep '"hostname"' | awk -F'"' '{print $4}')
  TYPE="WAL"
  ceph osd metadata $osd 2>/dev/null | grep -q bluefs_db && TYPE="DB"
  echo "osd.$osd  $HOST  $TYPE  apply_latency: ${LAT}ms"
done

I haven’t run a formal A/B comparison yet — designing a test that isolates the WAL vs DB variable without disrupting a live cluster with 10 TiB of data is tricky. The OSDs serve different PGs with different access patterns, so raw latency numbers aren’t directly comparable. A proper test would require creating matched OSDs on the same node with identical data, which means temporarily destroying and recreating an OSD. That’s a project for a maintenance window, not a Tuesday afternoon. If I get to it, I’ll publish the results as a follow-up.

My Hardware Setup

The AlteredCarbon Cluster

Node	Role	CPU	RAM	Boot	Ceph SSD	Ceph Data
harlan	OSD host	i7-2600	31 GB	2x 128GB SSD (ZFS mirror)	CT500MX500SSD1 (500GB)	3x 5TB Seagate USB
kovacs	OSD host	i7-2600	31 GB	2x 149GB HDD (ZFS mirror)	CT500MX500SSD1 (500GB)	3x 5TB Seagate USB
poe	OSD host	i7-2600	31 GB	2x 149GB HDD (ZFS mirror)	CT500MX500SSD1 (500GB)	3x 5TB Seagate USB
edgar	OSD host	i7-2600	31 GB	2x 931GB HDD (ZFS mirror)	CT500MX500SSD1 (500GB)	3x 5TB Seagate USB
quell	OSD host	i7-2600	31 GB	2x 128GB SSD (ZFS mirror)	CT500MX500SSD1 (500GB)	3x 5TB Seagate USB
tanaka	Monitor only	i5-2400	16 GB	2x 465GB HDD (ZFS mirror)	—	—

15 OSDs across 5 hosts. 69 TiB raw, 39 TiB available with 3x replication. Each OSD host has one Crucial MX500 500GB SATA SSD partitioned into three 100GB LVs for WAL/DB acceleration.

The cluster hasn’t always been this size. Kovacs lost two USB drives during a power failure in September 2025 and ran with a single OSD for months before replacement drives restored it to three. Poe was added as the fifth OSD host during the same period. The hardware table above reflects the current state.

SSD Selection: Crucial MX500

The MX500 was chosen for:

Endurance — 180 TBW for the 500GB model, adequate for WAL/DB workloads
Price — ~$50-60, shared across 3 OSDs = ~$17-20 per OSD
SATA interface — fits the Dell OptiPlex 990’s available SATA ports (one remaining after boot mirror and DVD removal)
Reliability — well-established consumer SSD with consistent performance

Each 500GB MX500 is partitioned into three 100GB LVM logical volumes, one per OSD. The remaining ~165GB is unallocated — available for wear leveling overhead or a future fourth OSD if I ever find a way to squeeze another USB drive onto a node.

USB Drive Selection: Seagate 5TB Portables

The Ceph data drives are a mix of Seagate models:

Seagate One Touch HDD (0bc2:ac41) — 5TB, USB 3.0
Seagate BUP Portable (0bc2:ab9a) — 5TB, USB 3.0
Seagate Expansion (0bc2:2344) — 5TB, USB 3.0

All require USB storage quirks for SMART monitoring and stable operation:

echo 'options usb-storage quirks=0bc2:ac2b:,0bc2:ac41:,0bc2:2344:,0bc2:ab9a:' > /etc/modprobe.d/usbstorage-quirks.conf
update-initramfs -u

Sizing WAL and DB Partitions

The Ceph documentation recommends:

WAL: 1-2% of OSD data size (but minimum useful size is ~1GB)
DB: 4% of OSD data size for optimal RocksDB performance

For a 5TB OSD:

WAL only: ~50-100GB
DB (includes WAL): ~200GB ideal, 100GB workable

My cluster uses 100GB per OSD for both WAL-only and DB configurations. This is slightly undersized for the DB recommendation but works well in practice because the actual data stored per OSD (~2-2.5 TiB with 3x replication) generates less metadata than the raw capacity would suggest.

What Happens If DB Is Too Small?

If the RocksDB metadata outgrows the DB device, Ceph spills the overflow onto the data device (the USB HDD). Performance degrades gracefully — you don’t lose data, you just lose the acceleration for the spilled portion. Monitor with:

ceph daemon osd.X perf dump | grep -i bluefs

Creating Hybrid OSDs

Method 1: Proxmox Web UI (Recommended)

Proxmox → Node → Ceph → OSD → Create: OSD

Disk: select the USB drive (e.g., /dev/sdd)
DB Disk: select the MX500 (e.g., /dev/sdc)
DB Disk Size: 100 GiB
WAL Disk: leave empty (DB includes WAL)

The UI creates the LVM logical volume on the SSD automatically. This creates a DB configuration (recommended).

Method 2: CLI with Pre-Sized LV

For more control, or to force a specific OSD ID:

# Create a 100GB LV on the MX500's VG
lvcreate -L 100G -n osd-db-osd6 ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c

# Create the OSD with DB on the pre-sized LV
ceph-volume lvm create --osd-id 6 --data /dev/sdd \
  --block.db ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c/osd-db-osd6

The --osd-id flag is useful when replacing a failed drive and you want to keep the same OSD number for physical labeling. I used this approach when replacing osd.6 on harlan after a USB drive failure.

Method 3: CLI with WAL Only (Legacy)

lvcreate -L 100G -n osd-wal-osd6 ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c

ceph-volume lvm create --osd-id 6 --data /dev/sdd \
  --block.wal ceph-8c2b41c2-65d6-4f39-ae13-d6f5d208878c/osd-wal-osd6

This is how the original harlan, kovacs, and quell OSDs were created. It works but provides less acceleration than DB.

Verifying the Configuration

After creation, verify the OSD has the SSD device attached:

ceph osd metadata osd.6 | grep -E '"id"|"hostname"|bluefs_wal|bluefs_db|bluestore_bdev'

For a cluster-wide view:

for osd in $(ceph osd ls | sort -n); do
  HOST=$(ceph osd metadata $osd 2>/dev/null | grep '"hostname"' | awk -F'"' '{print $4}')
  TYPE="WAL"
  ceph osd metadata $osd 2>/dev/null | grep -q bluefs_db && TYPE="DB"
  echo "osd.$osd  $HOST  $TYPE"
done

Performance Results

What the SSD Acceleration Changes

I don’t have formal before/after benchmarks — I added the MX500 SSDs when building the OSDs, not as a retrofit. But the improvement is obvious in daily use, and the theory explains why: the SSD eliminates HDD seek latency for the operations that happen most frequently (metadata lookups and write journaling), while the USB 3.0 bus (~125 MB/s) remains the bottleneck for bulk data.

Observed improvements with SSD WAL/DB:

Small write latency: Writes acknowledge at SSD speed before flushing to HDD in the background
Directory listings on CephFS: Responsive even for directories with thousands of files (our media library has ~2.94M objects)
OSD peering after restart: Minutes instead of tens of minutes — I saw this directly during the April 2026 osd.6 replacement when the new OSD peered quickly despite backfilling 4.5TB of data
Scrub operations: Faster completion with less impact on client I/O

Where It Doesn’t Help

Large sequential reads (streaming media from Jellyfin) — bottlenecked by USB 3.0, not metadata
Large sequential writes (bulk file copies to CephFS) — same USB bottleneck
Network-limited operations — 1GbE is often the bottleneck before USB is

Current Cluster Performance

root@harlan:~# ceph -s
  data:
    pools:   4 pools, 577 pgs
    objects: 2.94M objects, 10 TiB
    usage:   30 TiB used, 39 TiB / 69 TiB avail

The cluster handles 10 TiB of stored data (30 TiB with 3x replication) across CephFS for media storage and RBD for VM block devices. The hybrid configuration keeps metadata operations responsive even during Ceph recovery and rebalancing events.

Cost Analysis

Per-OSD Cost Comparison

Configuration	Drive Cost	SSD Cost	Total per OSD
Hybrid (USB + SSD WAL/DB)	$100-129 (5TB USB)	~$17-20 (1/3 of MX500)	$117-149
All-SSD	$400-600 (5TB enterprise SSD)	included	$400-600
USB only (no SSD)	$100-129 (5TB USB)	$0	$100-129

Full Cluster Cost

For the 15-OSD AlteredCarbon cluster:

Configuration	Total Cost	Performance
Hybrid (actual)	~$1,800-2,200	Excellent for homelab
All-SSD equivalent	~$6,000-9,000	Overkill for USB bandwidth
USB only	~$1,500-1,900	Sluggish metadata

The hybrid approach adds ~$300 (5x MX500 SSDs) to the USB-only cost and delivers the majority of the SSD performance benefit. The all-SSD option would be wasted on USB 3.0 bandwidth — you’d pay 4x more and still be bottlenecked by the bus.

When the Math Changes

The hybrid approach stops making sense when:

You upgrade to 10GbE networking — the USB drives become the clear bottleneck, and all-SSD or NVMe becomes worth the cost
You switch to SATA-connected HDDs — internal SATA is fast enough that WAL/DB separation provides less relative improvement
You need consistent low latency — databases and high-IOPS workloads need all-SSD regardless of cost

Operational Considerations

SSD Wear Monitoring

The MX500 SSDs handle constant WAL/DB writes. Monitor wear with:

smartctl -A /dev/sdc | grep -E "Wear_Leveling|Media_Wearout|Available_Reservd"

At 100GB per OSD with 3 OSDs per SSD, the write amplification is modest. My MX500s show minimal wear after months of operation. Budget for replacement every 3-5 years depending on write volume.

USB Drive Health

USB Ceph drives fail in annoying ways — hung USB bridges that block every command, drives that disappear from the bus entirely, and enclosures that enumerate for half a second then disconnect.

Essential monitoring:

smartctl -H /dev/sdd -d sat,12

See Enabling SMART Monitoring on Seagate USB Drives and USB Drive SMART Updates for the full setup including USB quirks and GRUB configuration.

Label your drives. When a drive fails, knowing which physical USB cable corresponds to which OSD saves significant debugging time.

The Single SSD Risk

The biggest risk of this architecture: one SSD failure takes out three OSDs simultaneously. Every OSD on that node loses its WAL/DB device and goes down at once. On a node with three 5TB OSDs, that’s ~13.6 TiB of raw capacity disappearing in an instant.

We mitigate this with cluster design. The AlteredCarbon cluster has 5 OSD hosts with 3 OSDs each. Ceph’s CRUSH map distributes replicas across hosts, so losing an entire node (all 3 OSDs) still leaves 2 copies of every object on the remaining 4 hosts. The cluster continues serving data in a degraded state while you replace the SSD.

This is the same failure domain as losing the node itself — a power supply failure, motherboard death, or even accidentally unplugging the wrong power cable would have the same effect. The SSD doesn’t make the failure worse, it just adds another component that can trigger it.

If you only have 3 OSD hosts, this risk is more serious. Losing one node’s SSD means losing one-third of your OSDs, and with min_size=2 you’d be one more failure away from data unavailability. Four or more OSD hosts is the minimum I’d recommend for this architecture.

SSD Failure Recovery

If the WAL/DB SSD fails, the affected OSDs will go down. Recovery depends on the failure mode:

SSD dies, data drives intact:

# The OSD data is still on the USB drive
# Replace the SSD, create new LVs, and recreate the OSDs
# Ceph will rebuild WAL/DB from the data on the USB drive
systemctl stop ceph-osd@0 ceph-osd@3 ceph-osd@6
# Replace SSD, partition, create LVs
# Recreate OSDs pointing to existing data drives with new WAL/DB LVs

USB drive dies, SSD intact:

The WAL/DB on the SSD is useless without the data drive. Remove the dead OSD, clean up the orphaned LV on the SSD, and add a replacement drive. This is exactly what I did when osd.6’s Seagate BUP Portable died.

Rebalancing with Hybrid OSDs

During Ceph rebalancing, the SSD acceleration helps significantly. Metadata operations that drive the rebalancing process (peering, PG migration decisions) run at SSD speed, even though the actual data movement is limited by USB bandwidth. This means rebalancing starts faster and tracks progress more efficiently, even if the bulk data transfer rate is the same.

Lessons Learned

Use DB, not WAL-only. The Proxmox UI defaults to DB for good reason. It accelerates both writes and metadata reads for the same SSD space.
100GB per OSD is a good starting point. For 5TB data drives with typical homelab workloads (media, backups, VMs), 100GB DB partitions haven’t spilled to the HDD.
One SSD per node, partitioned for all OSDs. A single 500GB MX500 handles three OSDs comfortably. No need for one SSD per OSD.
The SSD doesn’t fix the USB bottleneck. Large sequential I/O is still limited by USB 3.0. The SSD fixes the metadata bottleneck, which is what makes the cluster feel responsive.
Label everything. Physical drive labels with OSD numbers save hours during failures. I learned this the hard way during the April 2026 osd.6 replacement.
Monitor SSD wear and USB health separately. They fail in different ways — SSDs wear out gradually (SMART attributes), USB drives fail suddenly (bus disconnects, bridge hangs).
Document your creation method. Whether you used the Proxmox UI (DB) or CLI (WAL), knowing which method created each OSD explains configuration differences when you’re debugging at 2 AM.

When ZFS and Ceph Problems Collide — Discovering the WAL vs DB inconsistency across the cluster
Ceph OSD Recovery After Power Failure — Recovering from cascading OSD failures on the same hardware
Proxmox Ceph Settings for the Homelab — Tuning scrub intervals and pool settings for homelab hardware
Optimizing Ceph Performance in Proxmox — mClock tuning and IOPS optimization
Enabling SMART Monitoring on Seagate USB Drives — Getting health data from USB Ceph drives
USB Drive SMART Updates — Updated SMART configuration for new drive models
Homelab Storage Economics: Ceph vs Single Drive — Cost analysis of distributed storage
Managing Ceph Nearfull Warnings — Capacity management for the cluster
Proxmox & Ceph Homelab Guide — All Proxmox and Ceph articles

References

Ceph BlueStore Configuration Reference — Official WAL/DB documentation
Florian’s Ceph Calculator — Planning OSD layouts and capacity
Proxmox Ceph Documentation — Proxmox-specific Ceph guidance
Ceph OSD SSD Acceleration Reference — Full OSD-to-device mapping for the AlteredCarbon cluster

Tags: proxmox, ceph, ssd, usb, storage, performance, homelab, wal, db, bluestore, dell-optiplex-990, seagate

Categories: proxmox, ceph, homelab, storage

About the Author: Michael McGarrah is a Cloud Architect with 25+ years in enterprise infrastructure, machine learning, and system administration. He holds an M.S. in Computer Science (AI/ML) from Georgia Tech and a B.S. in Computer Science from NC State University, and is currently pursuing an Executive MBA at UNC Wilmington. LinkedIn · Substack · GitHub · ORCID · Google Scholar · Resume