Drive replacement- unmountable file system

jordanmw · May 20, 2020

I'm sure there are too many of these posts to count, but I am having this issue after replacing a drive with errors. I replaced disk2 on my array after seeing some errors and the array not starting. I had a replacement drive ready and used the typical process here: https://wiki.unraid.net/Replacing_a_Data_Drive to do so. After the rebuild completed, the drive was not mountable so I thought I could scan and repair the drive with the array started since it is btrfs- but the options are greyed out. I tried from the terminal and it also threw an error. I really didn't have a lot of data on that drive that I care about but I can't even start the array and would really like to recover if possible. Attached diagnostics.

Any suggestions to get things back to normal without loss? My parity and other disks never ran into any issues, so I should be able to recover, no?

tower-diagnostics-20200519-1940.zip

JorgeB · May 20, 2020

See here for some recovery options, try the "ro,notreelog,nologreplay" mount first, if that doesn't work btrfs restore is probably the best option.

jordanmw · May 20, 2020

I went through all of those suggestions and am pretty bummed out that it errored at every attempt of reading the filesystem and all repair attempts. I guess I expected this to be a non-issue and unraid to handle a single drive failure much better than I am seeing. I would still really like to recover- and just noticed that my backup got overwritten last night!

I wasn't holding much on that drive but had some unraid system files there- just not sure which ones were pointed to that disk.

The diagnostic that was attached above was from before I lost my backup with last night's overwrite.

Anyone have any suggestions on recovering OR rebuilding? I still have my VM images, and system\docker and system\libvirt- not sure what else I'm missing. Can someone help?

Also- is there any way to better prepare for a single drive failure- dual parity or some better way to configure my disks to prevent a single disk from taking down my dockers and vms? Shouldn't the contents of the disk have been emulated when the disk failed?

JorgeB · May 20, 2020

Drive failure not the same as filesystem corruption, Unraid can only help with the former, though sometimes a drive failing might corrupt the filesystem.

jordanmw · May 20, 2020

2 minutes ago, johnnie.black said:

Drive failure not the same as filesystem corruption, Unraid can only help with the former, though sometimes a drive failing might corrupt the filesystem.

I guess I must be in that fun category then Jonnie! My disk was showing errors- so I went through the recommendations on the forum/support faqs to replace the device that had errors. Should I have gone through some other process before I replaced the drive- or after? Shouldn't unraid been able to rebuild that drive since parity and all other disks were good? Are we saying that the file system corruption had gone on for so long that the rebuild just rebuilt garbage back onto the drive?

What other things should I be doing to assure a failure like this doesn't occur again? Am I really SOL to get that disk data back? Is there some other file system I should consider that makes things more reliable or easier to recover?

JorgeB · May 20, 2020

Was the disk disabled before you replaced it? And if yes was the emulated disk mounting correctly?

jordanmw · May 20, 2020

No- it was not disabled until I stopped the array to unassign and replace it.

JorgeB · May 20, 2020

Then that suggests parity wasn't 100% valid, or something else happened before the replacement, the diags just show after you replaced the disk, did you save the old ones by chance?

jordanmw · May 20, 2020

I don't think so, I'll check. Sounds like I really need to understand more about how things are being written- what happens if it starts erroring while parity sync is going? Does that totally negate my parity- or just fail the sync? Is there something I am not doing that I should be?

JorgeB · May 20, 2020

15 hours ago, jordanmw said:

array after seeing some errors and the array not starting.

The problem likely resulted from this, but can't say more without diags.

jordanmw · May 20, 2020

Ok Johnnie, I'll assume that had something to do with it. Here is the questions I have at this point:

1. At what point does a disk failure take all the data on that disk- with it?

2. Should I be preventing parity from syncing if I see errors?

3. Will a reboot or stopping/starting the array make things worse?

4. If I did see the errors- and wanted to MAKE ABSOLUTELY sure that I could recover- what are the best practices?

5. Is there anything else I can do to make sure I have enough redundancy to recover if I do notice errors on one of my disks?

Sorry for all the questions, but I really thought I had a hold of this process until my disk failure made everything melt down. I have backups of the really important stuff but really thought it was more resilient to a single disk failure.

I guess the last question would be- is there a way to get back to just having errors on one drive- so I can go through the process from the point of those errors showing- or did replacing that drive and attempting a rebuild screw up any chance I had of getting things back to that error state?

JorgeB · May 21, 2020

Some questions are difficult to answer, it depends on the exact circumstances, I can answer some:

10 hours ago, jordanmw said:

Should I be preventing parity from syncing if I see errors?

Yes, parity checks should always be non correct unless sync errors are expected, like after an unclean shutdown for example.

10 hours ago, jordanmw said:

Will a reboot or stopping/starting the array make things worse?

No usually, but always download diags before rebooting if there were issues/errors.

10 hours ago, jordanmw said:

If I did see the errors- and wanted to MAKE ABSOLUTELY sure that I could recover- what are the best practices?

If you have doubts download the diags and ask for help in the forum before doing anything else.

10 hours ago, jordanmw said:

Is there anything else I can do to make sure I have enough redundancy to recover if I do notice errors on one of my disks?

Unraid is never a backup, it does give you some redundancy but many other things can happen, user error, ransomware, fs corruption, etc, you need to have backups of anything important or irreplaceable.

jordanmw · May 21, 2020

I do have CA backup and restore- with several of the restore points of the last few weeks- the only thing that I lost is the libvirt.img that somehow got put on the disk that went down- any way to recover that or recreate it easily?

jordanmw · May 21, 2020

On 5/20/2020 at 11:44 AM, johnnie.black said:

The problem likely resulted from this, but can't say more without diags.

Here are a few of the last diags before the array went offline.

I realize now that I am missing my iso share and and whatever else was in my /mnt/user, that means there are probably more things missing than I thought. Any other hopeful methods to recover or rebuild- I do have the xml config backups for my vms also.

download_2020-05-21_13-44-17.zip

jordanmw · May 21, 2020

Got 2 of the 4 vms back up by grabbing some old configs and img files, grabbing some isos and a few minutes searching the forums when I hit errors. I guess I need to refine my backup/ restore process since this is the second time I've had to perform it. My biggest fault was not noticing one of my backup locations was assigned to the array disk that failed. I guess it screwed me up that there were 3 different locations that had to be empty. Rookie mistake I guess, but I thought I had some ability to recover with a new disk as long as i caught it quickly. Even had a spare ready to drop in, and somehow i still lost a couple days of uptime. It might have been a bigger deal if these were anything but gaming machines and servers. My kids might disagree, but they're lucky to have it at all.

JorgeB · May 22, 2020

The diags you posted show where the problem began, disk1 was having ATA and read errors left and write and it ended up corrupting the filesystem even without being disable by Unraid:

May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 Sense Key : 0x5 [current]
May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 ASC=0x21 ASCQ=0x0
May 18 06:13:25 Tower kernel: sd 1:0:0:0: [sdb] tag#24 CDB: opcode=0x88 88 00 00 00 00 00 25 6a da 20 00 00 00 20 00 00
May 18 06:13:25 Tower kernel: print_req_error: I/O error, dev sdb, sector 627759648
May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759584
May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759592
May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759600
May 18 06:13:25 Tower kernel: md: disk1 read error, sector=627759608
May 18 06:13:25 Tower kernel: ata1: EH complete
May 18 06:13:25 Tower kernel: BTRFS error (device md1): parent transid verify failed on 379424161792 wanted 869496 found 844525
May 18 06:13:25 Tower kernel: loop: Write error at byte offset 67108864, length 4096.
May 18 06:13:25 Tower kernel: print_req_error: I/O error, dev loop2, sector 131072
May 18 06:13:25 Tower kernel: BTRFS warning (device loop2): lost page write due to IO error on /dev/loop2
May 18 06:13:25 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
May 18 06:13:25 Tower kernel: BTRFS error (device md1): parent transid verify failed on 379424129024 wanted 869496 found 864712
May 18 06:14:16 Tower kernel: BTRFS error (device md1): parent transid verify failed on 379424718848 wanted 869496 found 864713
May 18 06:14:45 Tower kernel: ata1.00: exception Emask 0x0 SAct 0xc00000 SErr 0x0 action 0x0
May 18 06:14:45 Tower kernel: ata1.00: irq_stat 0x40000008
May 18 06:14:45 Tower kernel: ata1.00: failed command: READ FPDMA QUEUED
May 18 06:14:45 Tower kernel: ata1.00: cmd 60/08:b0:88:cd:97/00:00:05:00:00/40 tag 22 ncq dma 4096 in
May 18 06:14:45 Tower kernel:         res 41/10:00:88:cd:97/00:00:05:00:00/40 Emask 0x481 (invalid argument) <F>
May 18 06:14:45 Tower kernel: ata1.00: status: { DRDY ERR }

So in this case Unraid couldn't help you since the fs corruption happened before the disk got disabled/replaced and it also wasn't anything you did wrong, in theory the filesystem should survive these errors but unfortunately it's not always the case, can also be a problem with the disks's firmware, in any case only backups could have saved you here.

Also for the typical user best to stick with XFS since it's usually more resilient, but not uncorruptible.

Drive replacement- unmountable file system

Recommended Posts

jordanmw

Link to comment

JorgeB

Link to comment

jordanmw

Link to comment

JorgeB

Link to comment

jordanmw

Link to comment

JorgeB

Link to comment

jordanmw

Link to comment

JorgeB

Link to comment

jordanmw

Link to comment

JorgeB

Link to comment

jordanmw

Link to comment

JorgeB

Link to comment

jordanmw

Link to comment

jordanmw

Link to comment

jordanmw

Link to comment

JorgeB

Link to comment

Join the conversation