Two drive fail same time with same error count and Docker error - weird


Recommended Posts

Hopefully someone can help with this one. System been fine for months, this morning notice Plex not up. Logged into, two drives red balled. Both showing exact same count of errors, which is weird.

Looks like around 11:38 last night, things went nuts, tons of disk errors and such. It weird though, that two drives fail same time, with same number of errors. Array still up as I have dual parity, but another weird one, docker won't start, like it thinks it all gone as well.

Also, if I got the VM tab, just says no VMs installed... and I do have one VM previously running.

 

I've attached Diag zip and screen shot from docker screen. I'm assuming either something with the controller or the cables on those two drives, or maybe even the back plane on the 3x5 cage, but I've never seen an issue with that.

 

docker_error.jpg

Diag.zip

Edited by Whaler_99
Link to comment

It looks as if a cache drive, disk14 and disk15 all dropped offline at about the same time.   Is there anything they share (cabling, disk controller) that needs checking.    As a result there is no SMART information for those drives in the diagnostics.

 

I suggest you power-cycle the server to see if those drives come back online and then post new diagnostics so we can see their state.

Link to comment
1 hour ago, itimpi said:

It looks as if a cache drive, disk14 and disk15 all dropped offline at about the same time.   Is there anything they share (cabling, disk controller) that needs checking.    As a result there is no SMART information for those drives in the diagnostics.

 

I suggest you power-cycle the server to see if those drives come back online and then post new diagnostics so we can see their state.

 

Thanks! I did notice the cache drive looks fine, two SDD's, and is available via \\tower\cache, but yes, going to power down and check connections, cards, etc. See if anything common between 14 and 15 and will also check regarding the SSD cache drives. Maybe they are all on the motherboard...

 

 

Link to comment

Cannot shut down server, stuck on "retry unmounting disks share(s)".

I checked the open files plugin, one PID seemed to be stuck, looked related to my VM, the img file was stuck, but the plugin wouldn't kill it. So, SSH in, killed that. Still nothing. Log is full of:

Nov  7 12:34:22 Tower emhttpd: Retry unmounting disk share(s)...
Nov  7 12:34:27 Tower emhttpd: Unmounting disks...
Nov  7 12:34:27 Tower emhttpd: shcmd (171808230): umount /mnt/cache
Nov  7 12:34:27 Tower root: umount: /mnt/cache: target is busy.
Nov  7 12:34:27 Tower emhttpd: shcmd (171808230): exit status: 32
Nov  7 12:34:27 Tower emhttpd: Retry unmounting disk share(s)...
Nov  7 12:34:32 Tower emhttpd: Unmounting disks...
Nov  7 12:34:32 Tower emhttpd: shcmd (171808231): umount /mnt/cache
Nov  7 12:34:32 Tower root: umount: /mnt/cache: target is busy.
Nov  7 12:34:32 Tower emhttpd: shcmd (171808231): exit status: 32
Nov  7 12:34:32 Tower emhttpd: Retry unmounting disk share(s)...
Nov  7 12:34:37 Tower emhttpd: Unmounting disks...

SSH in again... try "unmount /mnt/cache" and just get target is busy.

 

Not sure how to "kill" the cache to get the server to shutdown gracefully....

 

Link to comment
1 minute ago, trurl said:

Emulated disks 14,15 unmountable, SMART looks OK. You should try to repair the filesystem on the emulated disks, and then if they are mountable you can rebuild them.

 

https://wiki.unraid.net/Manual/Storage_Management#Running_the_Test_using_the_webGui

 

That is what I'm looking at now, thanks! Looks like there are issues....

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
Metadata CRC error detected at 0x439496, xfs_agf block 0x7ffffff9/0x200
agf has bad CRC for ag 1
Metadata CRC error detected at 0x43cfad, xfs_bnobt block 0x80000000/0x1000
btree block 1/1 is suspect, error -74
Metadata CRC error detected at 0x43cfad, xfs_cntbt block 0x80000008/0x1000
btree block 1/2 is suspect, error -74
out-of-order cnt btree record 25 (195207574 483) block 1/2
out-of-order cnt btree record 26 (48588613 610) block 1/2
out-of-order cnt btree record 27 (22270440 653) block 1/2
out-of-order cnt btree record 28 (212725104 734) block 1/2
out-of-order cnt btree record 29 (68889253 763) block 1/2
out-of-order cnt btree record 30 (69842275 976) block 1/2
out-of-order cnt btree record 31 (408138 1065) block 1/2
out-of-order cnt btree record 32 (75921534 1140) block 1/2
out-of-order cnt btree record 33 (11907778 1249) block 1/2
out-of-order cnt btree record 34 (65323505 1402) block 1/2
out-of-order cnt btree record 35 (17823671 1554) block 1/2
out-of-order cnt btree record 36 (13061415 1664) block 1/2
out-of-order cnt btree record 37 (410865 1803) block 1/2
out-of-order cnt btree record 38 (14734129 1971) block 1/2
out-of-order cnt btree record 39 (71076551 2176) block 1/2
out-of-order cnt btree record 40 (55598632 2674) block 1/2
out-of-order cnt btree record 41 (66915612 2723) block 1/2
out-of-order cnt btree record 42 (15412807 2926) block 1/2
out-of-order cnt btree record 43 (34874023 3272) block 1/2
out-of-order cnt btree record 44 (15434037 3499) block 1/2
out-of-order cnt btree record 45 (21234912 3545) block 1/2
out-of-order cnt btree record 46 (96350505 3640) block 1/2
out-of-order cnt btree record 47 (41596365 4210) block 1/2
out-of-order cnt btree record 48 (254222786 4228) block 1/2
out-of-order cnt btree record 49 (56382107 4640) block 1/2
out-of-order cnt btree record 50 (116355071 4842) block 1/2
out-of-order cnt btree record 51 (67565316 5035) block 1/2
out-of-order cnt btree record 52 (196228737 5072) block 1/2
out-of-order cnt btree record 53 (48635751 5179) block 1/2
out-of-order cnt btree record 54 (59835462 5193) block 1/2
out-of-order cnt btree record 55 (63738351 5424) block 1/2
out-of-order cnt btree record 56 (16719899 5864) block 1/2
out-of-order cnt btree record 57 (35530273 6147) block 1/2
out-of-order cnt btree record 58 (268428864 6591) block 1/2
out-of-order cnt btree record 59 (37604543 6854) block 1/2
out-of-order cnt btree record 60 (69367424 6866) block 1/2
out-of-order cnt btree record 61 (64078359 7016) block 1/2
out-of-order cnt btree record 62 (69657714 7222) block 1/2
out-of-order cnt btree record 63 (199624535 7261) block 1/2
out-of-order cnt btree record 64 (16698834 7369) block 1/2
out-of-order cnt btree record 65 (65772445 7645) block 1/2
out-of-order cnt btree record 66 (34775739 7789) block 1/2
out-of-order cnt btree record 67 (18257967 8398) block 1/2
out-of-order cnt btree record 68 (19068571 8580) block 1/2
out-of-order cnt btree record 69 (392018 8644) block 1/2
out-of-order cnt btree record 70 (49514736 8874) block 1/2
out-of-order cnt btree record 71 (98503461 9154) block 1/2
out-of-order cnt btree record 72 (64967088 9377) block 1/2
out-of-order cnt btree record 73 (11450231 9611) block 1/2
out-of-order cnt btree record 74 (266755081 10222) block 1/2
out-of-order cnt btree record 75 (210009117 12488) block 1/2
out-of-order cnt btree record 76 (54254618 12559) block 1/2
out-of-order cnt btree record 77 (66681910 12934) block 1/2
out-of-order cnt btree record 78 (122412185 13163) block 1/2
out-of-order cnt btree record 79 (194016293 15256) block 1/2
out-of-order cnt btree record 80 (66762172 15968) block 1/2
out-of-order cnt btree record 81 (56180134 16629) block 1/2
out-of-order cnt btree record 82 (194582706 16801) block 1/2
out-of-order cnt btree record 83 (226786945 17666) block 1/2
out-of-order cnt btree record 84 (66461810 17865) block 1/2
out-of-order cnt btree record 85 (200768603 17936) block 1/2
out-of-order cnt btree record 86 (186525262 19021) block 1/2
out-of-order cnt btree record 87 (224943003 22652) block 1/2
out-of-order cnt btree record 88 (65862874 23072) block 1/2
out-of-order cnt btree record 89 (64823156 24092) block 1/2
out-of-order cnt btree record 90 (199479370 25987) block 1/2
out-of-order cnt btree record 91 (223888721 33049) block 1/2
out-of-order cnt btree record 92 (225972644 41700) block 1/2
agf_freeblks 27127499, counted 27153867 in ag 1
sb_fdblocks 374735719, counted 374919015
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
data fork in ino 2150484118 claims free block 297875666
data fork in ino 2150770520 claims free block 273513335
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
free space (1,11450231-11459841) only seen by one free space btree
free space (1,19068114-19068569) only seen by one free space btree
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 3
        - agno = 2
        - agno = 4
        - agno = 5
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

I'm running this from the GUI, I assume I now remove the -n option, run check again and it should repair things?

 

Link to comment
3 minutes ago, trurl said:

yes. post the results and new diagnostics.

 

Depending on how well repair goes you might not want to rebuild to those same disks so you can keep them as is in case they are better than the emulated repair. Do you have spare disks for rebuild?

 

Looks to have run ok. Here are results from Disk 15 after running repair and then another Check.

Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 4 - agno = 2 - agno = 5 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 4
        - agno = 2
        - agno = 5
        - agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

Also here are update DIags. 
Unfortunately I don't have extra 6TB drives. I have one 8TB as a spare for the Parity disks.

tower-diagnostics-20211107-1409.zip

Link to comment

Both emulated disks mounted and seem to have plenty of data. And doesn't look like anything in lost+found.

 

You can rebuild both at once since you have dual parity.

 

Might be safer to rebuild one at a time to a spare just to make sure your hardware problems don't come back.

 

You can rebuild to a larger disk, as long as it isn't any larger than any parity disk.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.