Two drive fail same time with same error count and Docker error - weird

Whaler_99 · November 7, 2021

Hopefully someone can help with this one. System been fine for months, this morning notice Plex not up. Logged into, two drives red balled. Both showing exact same count of errors, which is weird.

Looks like around 11:38 last night, things went nuts, tons of disk errors and such. It weird though, that two drives fail same time, with same number of errors. Array still up as I have dual parity, but another weird one, docker won't start, like it thinks it all gone as well.

Also, if I got the VM tab, just says no VMs installed... and I do have one VM previously running.

I've attached Diag zip and screen shot from docker screen. I'm assuming either something with the controller or the cables on those two drives, or maybe even the back plane on the 3x5 cage, but I've never seen an issue with that.

Diag.zip

Edited November 7, 2021 by Whaler_99

itimpi · November 7, 2021

It looks as if a cache drive, disk14 and disk15 all dropped offline at about the same time. Is there anything they share (cabling, disk controller) that needs checking. As a result there is no SMART information for those drives in the diagnostics.

I suggest you power-cycle the server to see if those drives come back online and then post new diagnostics so we can see their state.

Whaler_99 · November 7, 2021

1 hour ago, itimpi said:

It looks as if a cache drive, disk14 and disk15 all dropped offline at about the same time. Is there anything they share (cabling, disk controller) that needs checking. As a result there is no SMART information for those drives in the diagnostics.

I suggest you power-cycle the server to see if those drives come back online and then post new diagnostics so we can see their state.

Thanks! I did notice the cache drive looks fine, two SDD's, and is available via \\tower\cache, but yes, going to power down and check connections, cards, etc. See if anything common between 14 and 15 and will also check regarding the SSD cache drives. Maybe they are all on the motherboard...

Whaler_99 · November 7, 2021

Cannot shut down server, stuck on "retry unmounting disks share(s)".

I checked the open files plugin, one PID seemed to be stuck, looked related to my VM, the img file was stuck, but the plugin wouldn't kill it. So, SSH in, killed that. Still nothing. Log is full of:

Nov  7 12:34:22 Tower emhttpd: Retry unmounting disk share(s)...
Nov  7 12:34:27 Tower emhttpd: Unmounting disks...
Nov  7 12:34:27 Tower emhttpd: shcmd (171808230): umount /mnt/cache
Nov  7 12:34:27 Tower root: umount: /mnt/cache: target is busy.
Nov  7 12:34:27 Tower emhttpd: shcmd (171808230): exit status: 32
Nov  7 12:34:27 Tower emhttpd: Retry unmounting disk share(s)...
Nov  7 12:34:32 Tower emhttpd: Unmounting disks...
Nov  7 12:34:32 Tower emhttpd: shcmd (171808231): umount /mnt/cache
Nov  7 12:34:32 Tower root: umount: /mnt/cache: target is busy.
Nov  7 12:34:32 Tower emhttpd: shcmd (171808231): exit status: 32
Nov  7 12:34:32 Tower emhttpd: Retry unmounting disk share(s)...
Nov  7 12:34:37 Tower emhttpd: Unmounting disks...

SSH in again... try "unmount /mnt/cache" and just get target is busy.

Not sure how to "kill" the cache to get the server to shutdown gracefully....

Whaler_99 · November 7, 2021

@itimpi System shut down, yep, those two drives and both SDD's (cache) are attached to the motherboard ports. Weirdly so are the parity drives, which seem fine. Also, all six of the drives are all in different cages and power connections, so definitely something with the mobo last night...

Whaler_99 · November 7, 2021

@itimpiPowered back up, those two drives now show "unmountable: not mounted" but everything else looks fine. Dockers back up, VM back up, etc... Latest Diag attached.

tower-diagnostics-20211107-1338.zip

trurl · November 7, 2021

Emulated disks 14,15 unmountable, SMART looks OK. You should try to repair the filesystem on the emulated disks, and then if they are mountable you can rebuild them.

https://wiki.unraid.net/Manual/Storage_Management#Checking_a_File_System

Whaler_99 · November 7, 2021

1 minute ago, trurl said:

Emulated disks 14,15 unmountable, SMART looks OK. You should try to repair the filesystem on the emulated disks, and then if they are mountable you can rebuild them.

https://wiki.unraid.net/Manual/Storage_Management#Running_the_Test_using_the_webGui

That is what I'm looking at now, thanks! Looks like there are issues....

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
Metadata CRC error detected at 0x439496, xfs_agf block 0x7ffffff9/0x200
agf has bad CRC for ag 1
Metadata CRC error detected at 0x43cfad, xfs_bnobt block 0x80000000/0x1000
btree block 1/1 is suspect, error -74
Metadata CRC error detected at 0x43cfad, xfs_cntbt block 0x80000008/0x1000
btree block 1/2 is suspect, error -74
out-of-order cnt btree record 25 (195207574 483) block 1/2
out-of-order cnt btree record 26 (48588613 610) block 1/2
out-of-order cnt btree record 27 (22270440 653) block 1/2
out-of-order cnt btree record 28 (212725104 734) block 1/2
out-of-order cnt btree record 29 (68889253 763) block 1/2
out-of-order cnt btree record 30 (69842275 976) block 1/2
out-of-order cnt btree record 31 (408138 1065) block 1/2
out-of-order cnt btree record 32 (75921534 1140) block 1/2
out-of-order cnt btree record 33 (11907778 1249) block 1/2
out-of-order cnt btree record 34 (65323505 1402) block 1/2
out-of-order cnt btree record 35 (17823671 1554) block 1/2
out-of-order cnt btree record 36 (13061415 1664) block 1/2
out-of-order cnt btree record 37 (410865 1803) block 1/2
out-of-order cnt btree record 38 (14734129 1971) block 1/2
out-of-order cnt btree record 39 (71076551 2176) block 1/2
out-of-order cnt btree record 40 (55598632 2674) block 1/2
out-of-order cnt btree record 41 (66915612 2723) block 1/2
out-of-order cnt btree record 42 (15412807 2926) block 1/2
out-of-order cnt btree record 43 (34874023 3272) block 1/2
out-of-order cnt btree record 44 (15434037 3499) block 1/2
out-of-order cnt btree record 45 (21234912 3545) block 1/2
out-of-order cnt btree record 46 (96350505 3640) block 1/2
out-of-order cnt btree record 47 (41596365 4210) block 1/2
out-of-order cnt btree record 48 (254222786 4228) block 1/2
out-of-order cnt btree record 49 (56382107 4640) block 1/2
out-of-order cnt btree record 50 (116355071 4842) block 1/2
out-of-order cnt btree record 51 (67565316 5035) block 1/2
out-of-order cnt btree record 52 (196228737 5072) block 1/2
out-of-order cnt btree record 53 (48635751 5179) block 1/2
out-of-order cnt btree record 54 (59835462 5193) block 1/2
out-of-order cnt btree record 55 (63738351 5424) block 1/2
out-of-order cnt btree record 56 (16719899 5864) block 1/2
out-of-order cnt btree record 57 (35530273 6147) block 1/2
out-of-order cnt btree record 58 (268428864 6591) block 1/2
out-of-order cnt btree record 59 (37604543 6854) block 1/2
out-of-order cnt btree record 60 (69367424 6866) block 1/2
out-of-order cnt btree record 61 (64078359 7016) block 1/2
out-of-order cnt btree record 62 (69657714 7222) block 1/2
out-of-order cnt btree record 63 (199624535 7261) block 1/2
out-of-order cnt btree record 64 (16698834 7369) block 1/2
out-of-order cnt btree record 65 (65772445 7645) block 1/2
out-of-order cnt btree record 66 (34775739 7789) block 1/2
out-of-order cnt btree record 67 (18257967 8398) block 1/2
out-of-order cnt btree record 68 (19068571 8580) block 1/2
out-of-order cnt btree record 69 (392018 8644) block 1/2
out-of-order cnt btree record 70 (49514736 8874) block 1/2
out-of-order cnt btree record 71 (98503461 9154) block 1/2
out-of-order cnt btree record 72 (64967088 9377) block 1/2
out-of-order cnt btree record 73 (11450231 9611) block 1/2
out-of-order cnt btree record 74 (266755081 10222) block 1/2
out-of-order cnt btree record 75 (210009117 12488) block 1/2
out-of-order cnt btree record 76 (54254618 12559) block 1/2
out-of-order cnt btree record 77 (66681910 12934) block 1/2
out-of-order cnt btree record 78 (122412185 13163) block 1/2
out-of-order cnt btree record 79 (194016293 15256) block 1/2
out-of-order cnt btree record 80 (66762172 15968) block 1/2
out-of-order cnt btree record 81 (56180134 16629) block 1/2
out-of-order cnt btree record 82 (194582706 16801) block 1/2
out-of-order cnt btree record 83 (226786945 17666) block 1/2
out-of-order cnt btree record 84 (66461810 17865) block 1/2
out-of-order cnt btree record 85 (200768603 17936) block 1/2
out-of-order cnt btree record 86 (186525262 19021) block 1/2
out-of-order cnt btree record 87 (224943003 22652) block 1/2
out-of-order cnt btree record 88 (65862874 23072) block 1/2
out-of-order cnt btree record 89 (64823156 24092) block 1/2
out-of-order cnt btree record 90 (199479370 25987) block 1/2
out-of-order cnt btree record 91 (223888721 33049) block 1/2
out-of-order cnt btree record 92 (225972644 41700) block 1/2
agf_freeblks 27127499, counted 27153867 in ag 1
sb_fdblocks 374735719, counted 374919015
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
data fork in ino 2150484118 claims free block 297875666
data fork in ino 2150770520 claims free block 273513335
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
free space (1,11450231-11459841) only seen by one free space btree
free space (1,19068114-19068569) only seen by one free space btree
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
        - agno = 4
        - agno = 5
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

I'm running this from the GUI, I assume I now remove the -n option, run check again and it should repair things?

trurl · November 7, 2021

1 minute ago, Whaler_99 said:

remove the -n option, run check again and it should repair things?

yes. post the results and new diagnostics.

Depending on how well repair goes you might not want to rebuild to those same disks so you can keep them as is in case they are better than the emulated repair. Do you have spare disks for rebuild?

Whaler_99 · November 7, 2021

3 minutes ago, trurl said:

yes. post the results and new diagnostics.

Depending on how well repair goes you might not want to rebuild to those same disks so you can keep them as is in case they are better than the emulated repair. Do you have spare disks for rebuild?

Looks to have run ok. Here are results from Disk 15 after running repair and then another Check.

Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 4 - agno = 2 - agno = 5 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 4
        - agno = 2
        - agno = 5
        - agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

Also here are update DIags.
Unfortunately I don't have extra 6TB drives. I have one 8TB as a spare for the Parity disks.

tower-diagnostics-20211107-1409.zip

trurl · November 7, 2021

sorry, should have told you to start the array in normal mode before getting diagnostics.

Whaler_99 · November 7, 2021

Started up, both drives red balled - Disabled, being emulated. So that's an improvement. New DIag attached, thanks for assistance @trurl!

tower-diagnostics-20211107-1417.zip

trurl · November 7, 2021

Both emulated disks mounted and seem to have plenty of data. And doesn't look like anything in lost+found.

You can rebuild both at once since you have dual parity.

Might be safer to rebuild one at a time to a spare just to make sure your hardware problems don't come back.

You can rebuild to a larger disk, as long as it isn't any larger than any parity disk.

trurl · November 7, 2021

I always like to ask.

Do you have another copy of anything important and irreplaceable?

Whaler_99 · November 7, 2021

2 minutes ago, trurl said:

I always like to ask.

Do you have another copy of anything important and irreplaceable?

Yep, I replicate the entire array to a secondary system, just in case.

trurl · November 7, 2021

30 minutes ago, Whaler_99 said:

Yep, I replicate the entire array to a secondary system, just in case.

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

Whaler_99 · November 7, 2021

16 minutes ago, trurl said:

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

Thanks! That is what I am doing.

JorgeB · November 8, 2021

This was caused by the quite common controller issue with some Ryzen boards, looks for a BIOS update, it's been reported to help, if not consider getting an add-on controller.

Whaler_99 · November 8, 2021

3 hours ago, JorgeB said:

This was caused by the quite common controller issue with some Ryzen boards, looks for a BIOS update, it's been reported to help, if not consider getting an add-on controller.

Thanks for the info! I'll get that done asap!

Whaler_99 · November 8, 2021

13 hours ago, JorgeB said:

This was caused by the quite common controller issue with some Ryzen boards, looks for a BIOS update, it's been reported to help, if not consider getting an add-on controller.

Yep, was two years out of date, updated it to the latest.

Drive rebuild all complete and successful. Everything working ok again! Thanks @itimpi @trurl and @JorgeB

Two drive fail same time with same error count and Docker error - weird

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation