Dealing with dead disks in a btrfs RAID1 array ⚓
24 Feb 2019tl;dr: Check your disk usage v/s RAID capacity to ensure that you can remove a disk before trying. If you can connect a new disk without removing the old one, run a btrfs replace
- it is much faster.
My homeserver setup has a 4 disk setup:
- 128GB Samsung EVO 850 SSD as the primary disk (root volume)
- A 3 Disk btrfs RAID1 Array that I use for almost everything else.
The 3 disks were:
- A WD-3.5inch-3TB that I shelled from a WD-MyBook. This was the oldest disk in the array
- 2xSeagate 2.5-inch-3TB external disks that I shelled from Seagate Expansion disks.
The WD disk had been giving rising errors recently, and I was noticing hangs on the system as well:
- My Steam saves would take time, and hang the game.
- Kodi would ocassionaly hang just switching between screens as it would load images from disk.
- gitea, which writes a lot to disk would get similar issues.
I asked a question on r/archlinux and confirmed that it indeed a dead disk.
Ordered a new Seagate Barracuda 3TB the next day, but my peculiar setup caused me a lot of pain before I could remove the dead disk. The primary issue was with the limited number of SATA connectors I had (just 4). The original setup had /dev/sdb,/dev/sdc,/dev/sdd
as the three RAID disks with /dev/sdb
being the dying WD.
This is what all I tried:
- Removing
/dev/sdb
and adding a new disk the array (/dev/sde
). Unfortunately, to add a disk to the array, you have to mount it first, and the setup just refused to mount in degraded mode. (It didn’t give a visibly error, so I didn’t know why) - I tried to keep the old disk attached over USB on a friend’s suggestion, but that didn’t work either. This was likely a cable issue, and I didn’t investigate this further.
- Booting with the original three disks but replacing the dying disk with the new one post boot. Didn’t work as I kept getting read/write errors to sdb even after it was disconnected.
In short:
- the system refused to mount the raid array with a missing disk (and I didn’t want to risk a boot with the array unavailable)
- I couldn’t do a live replace because I had a limited number of SATA connectors.
What worked:
Running a btrfs device delete
and leting it run overnight. It gave an error after quite a long time that finally helped me figure out the problem:
btrfs device delete /dev/sdb1 /mnt/xwing
ERROR: error removing device '/dev/sdb1': No space left on device
btrfs fi df /mnt/xwing
Data, RAID1: total=2.98TiB, used=2.98TiB
System, RAID1: total=32.00MiB, used=544.00KiB
Metadata, RAID1: total=5.49GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
The RAID array was 2.7TBx3 disks and I was storing roughly 2.98TB
of data. To switch to a RAID1 setup with just 2 disks, I needed to delete some data. I ended up clearing out a few steam games (bye bye Witcher 3) and ran another btrfs device delete
to resolve the issue.
If you are faced with a situation where you have to remove a device, but can’t do a live replace, here’s what you need:
- Check that your disk removal does not impact any data storage. Your n-1 disk array should have enough capacity to store everything.
- Run a
btrfs device delete
- Reboot
- Re-attach new disk, and then run a
btrfs device add
As a retro, I posted a summary with the issues I faced on the btrfs mailing list
If you’re interested in my self-hosting setup, I’m using Terraform + Docker, the code is hosted on the same server, and I’ve been writing about my experience and learnings:
- Part 1, Hardware
- Part 2, Terraform/Docker
- Part 3, Learnings
- Part 4, Migrating from Google (and more)
- Part 5, Home Server Networking
- Part 6, btrfs RAID device replacement
If you have any comments, reach out to me
Published on February 24, 2019