Power outage, multiple problems

March 26, 20233 yr

Hi all. Just suffered a power outage at my house, no UPS. (Learning the hard way... one has since been ordered)

When the server came back up I found the following:

1. Disk 1 was missing (wouldn't even show up under unassigned devices)

2. Cache pool showed uncorrectable errors during a scrub

Strangely, even though the server had uncleanly shut down, there was no parity check. Maybe because Disk 1 was missing?

Anyway, I ran the parity check, and after 20 minutes checked back in to see Disk 1 sitting in the UD section. I guess it needed time to wake back up after getting the power cut. All the SMART attributes looked fine but I wanted to make sure that it won't die on me again or something, so I ran a short SMART (passed) and a long SMART (still running, 20% left at this point) to make sure it won't die again during the rebuild.

Now since the server is configured in dual parity I'm not worried about data loss. What I am worried about is the cache pool since that holds the appdata. I do keep daily backups with the backup plugin, so worst case scenario is I lose about less than a day's worth of changes. But I wanted to see if I can just use the current data. However, the scrub command won't tell me exactly which files are affected.

In the logs I see the following:

[42493.446568] BTRFS error (device dm-3): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 32, gen 0
[42493.446579] BTRFS error (device dm-3): unable to fixup (regular) error at logical 82579787776 on dev /dev/mapper/nvme0n1p1
[42493.929861] BTRFS error (device dm-3): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 33, gen 0
[42493.929871] BTRFS error (device dm-3): unable to fixup (regular) error at logical 85866598400 on dev /dev/mapper/nvme0n1p1
[42496.080932] BTRFS error (device dm-3): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 34, gen 0
[42496.080942] BTRFS error (device dm-3): unable to fixup (regular) error at logical 100984004608 on dev /dev/mapper/nvme0n1p1
[42501.552701] BTRFS error (device dm-3): bdev /dev/mapper/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 35, gen 0
[42501.552709] BTRFS error (device dm-3): unable to fixup (regular) error at logical 136209657856 on dev /dev/mapper/nvme0n1p1

But when I run `btrfs inspect-internal logical-resolve <logical_address> /mnt/cache` it returns nothing for all four affected addresses.

BTRFS check (read-only) reports that there are no errors.

I've stopped the Docker service to prevent further damage (stopped before all the checking and triage – so about 15-20 minutes after the server came up) and had a backup taken (again, with the uncorrectable errors – so I'm guessing some files in there are corrupt?)

At this point this is my plan for the day:

1. Wait for parity check to complete (hopefully without any errors; if there are then it's another 14 hours worth of parity checks with the write corrections turned on. Is there a way to just fix the errors without rerunning the entire thing?)

2. Stop the array

3. Wipe cache pool (how? Can't find the format button... does it show up when the array is stopped?)

4. Make sure long SMART for dev1 passes

5. Reassign dev1 to Disk 1

6. Start array

7. While rebuilding, restore appdata backup

Is the plan okay or should I do something else? Anything I've missed? Feedback would be appreciated!

Attached diagnostics.

dipper-diagnostics-20230326-1022.zip

Quote

March 26, 20233 yr

Author

So the parity check completed without errors, so I stopped the array to begin maintenance.

Wiping the cache pool was relatively simple (I was definitely overthinking things – for future reference, unassign the SSD from the pool, run `blkdiscard -f /dev/nvmeXnX` with your drive number, then reassign to pool, and erase).

Once I verified that the long SMART passed for dev1 I re-assigned the drive back to Disk 1 and started the array. Then I rebuilt the Docker images and started restoring the appdata.

Hopefully this helps someone else who has the same issue!

Quote

March 26, 20233 yr

Author

I also had to change configurations of some Docker apps, because the underlying Docker network changed from 172.18.0.x to 172.17.0.x. Something to keep in mind if some services are not resolving after the rebuild.

Quote

Power outage, multiple problems

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)