disk1 disable (after huge read errors), but content not actually emulated by parity drive

wilsonhomelab · March 21, 2022

A month ago, I transferred my array (Ironwolf 8tb x2, 8 months old) from desktop hardwares to server hardwares (dual Xeon E5-2680 v4, 64GB Samsung ecc memory, Supermicro X10DRL-i server mobo ). Unraid recognised all drives (parity, disk 1) in the array and ran smoothly for couple of weeks. 10 day ago, I purchased an additional Ironwolf 8tb (disk2) and added to the array after preclear. I ran parity check without any problem with 220MB/s average speed.

Last week, the disk 1 started showing errors after I transferred some media files from unassigned drives using Krusader (docker container). So I tried swapping SATA cable / SATA ports, but the disk errors only seemed worse (from 4 errors to 80 errors). Then I tried parity check which only gave me KB/s speed (I paused it). I knew something was wrong. I quickly checked the diagnostic and showing "ata6: hard resetting link", "ata6: link is slow to respond, please be patient (ready=0)" unraid-xeon-diagnostics-20220312-1735.zip

Last night, I saw the accumulated disk 1 errors as high as 800 (after a week), so in order to rule out the potential problems from the mobo SATA ports and the SATA calbles I used, I installed a working reliable LSI SAS2008 with a working reliable SFF8087-to-4-SATA cable. After booting up Unraid, all disks were recognised, but disk1 and disk 2 were both "unmountable" with option to format. unraid-xeon-diagnostics-20220321-0240.zip

I then upgraded the mobo BIOS to the latest and ran 6 hrs of Memtest86+ without error. 365995768_iKVM_Memtest86.jpg.7aa16a8c21428f82e5249f308668947a.jpg

This morning, I restarted Unraid and saw disk 1 "unmountable"; parity and disk 2 are working. Shares stored in disk 1 is not showing up (not emulated at all). I just wonder why emulation is not working with only one disk down? unraid-xeon-diagnostics-20220321-0909.zip

Could you please also advise what should I do now? repair the xfs file system? Please help~

Edited March 21, 2022 by wilsonhomelab

wilsonhomelab · March 21, 2022

I removed the disk 1, and restarted Unraid. As I expected, the parity drive doesn't emulate the missing drive. Please help!

My priority will be recovering the missing data from parity. I hope it will be still possible. Then test the problem drive as it is still under warranty.

itimpi · March 21, 2022

Actually your screen shot suggests that the drive IS being emulated - just that it is flagged as unmountable. Handling of unmountable drives is covered here in the online documentations accessible via the ‘Manual’ link at the bottom of the GUI.

wilsonhomelab · March 21, 2022

I click the "CHECK" button under "disk1 setting", and nothing happen. So I typed the command as follow

root@UNRAID-Xeon:~# xfs_repair /dev/md1
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Could you please explain what the "ERROR" session ? Thanks

itimpi · March 21, 2022

That is a standard warning message and can normally be ignored as mentioned here in the online documentation that can be accessed via the Manual link at the bottom of the GUI

wilsonhomelab · March 21, 2022

I went ahead with the -L option. Now I can see the disk 1 content!😀

If I understand correctly, will the parity start rebuilding the physical disk 1 once I restart the array with disk 1 plugged in?

I also want to make sure whether this issue is an indication of a failing drive. May I perform a pre-clear of disk 1 first to see if it can withstand the heavy read / write process? If it is good, I put it back to the array for rebuild.

root@UNRAID-Xeon:~# xfs_repair /dev/md1 -L
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - 04:55:12: zeroing log - 119233 of 119233 blocks done
        - scan filesystem freespace and inode maps...
        - 04:55:13: scanning filesystem freespace - 32 of 32 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 04:55:13: scanning agi unlinked lists - 32 of 32 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 15
        - agno = 30
        - agno = 16
        - agno = 17
        - agno = 1
        - agno = 31
        - agno = 18
        - agno = 2
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 3
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - agno = 28
        - agno = 4
        - agno = 5
        - agno = 29
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - 04:55:24: process known inodes and inode discovery - 148992 of 148992 inodes done
        - process newly discovered inodes...
        - 04:55:24: process newly discovered inodes - 32 of 32 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 04:55:24: setting up duplicate extent list - 32 of 32 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 2
        - agno = 7
        - agno = 9
        - agno = 11
        - agno = 12
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 13
        - agno = 28
        - agno = 5
        - agno = 1
        - agno = 8
        - agno = 19
        - agno = 17
        - agno = 6
        - agno = 18
        - agno = 20
        - agno = 4
        - agno = 15
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 10
        - agno = 27
        - agno = 16
        - agno = 29
        - agno = 30
        - agno = 14
        - agno = 31
        - 04:55:24: check for inodes claiming duplicate blocks - 148992 of 148992 inodes done
Phase 5 - rebuild AG headers and trees...
        - 04:55:25: rebuild AG headers and trees - 32 of 32 allocation groups done
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
        - 04:55:31: verify and correct link counts - 32 of 32 allocation groups done
Maximum metadata LSN (6:909635) is ahead of log (1:2).
Format log to cycle 9.
done

itimpi · March 21, 2022

Putting the disk back will cause it to be rebuilt to match the emulated drive.

39 minutes ago, wilsonhomelab said:

May I perform a pre-clear of disk 1 first to see if it can withstand the heavy read / write process? If it is good, I put it back to the array for rebuild.

That is fine. You could also try running an extended SMART test on the drive before adding it back.

wilsonhomelab · March 21, 2022

While disk 1 is performing pre-clear, disk 2 came up 5 errors🤨. Should I repeat the same procedure? start as maintenance mode and do check-and-repair xfs file system again? I think I am still not getting the cause of the disk error problems. unraid-xeon-diagnostics-20220322-0936.zip

wilsonhomelab · March 25, 2022

I think I finnally found the culprit. The I/O errors actually came from, I believe, the insufficient sata power. I used a molex-to-2x-15-pin-sata cable to power the silverstone 5HDD hot-swag cage due to the clerance problem. And I never ran into problems untill I populated all 5 HDD into a single cage (I have two of the cages). I found a simular issue, which prompted me to try power the cage with two dedicated sata cable straigh from the PSU.

415380546_ScreenShot2022-03-25at12_00_03pm.png.d94d0e25c60219831c9259e0571234a8.png

THE end result is that the disk 1 was pre-clear at double of data rate at 200+MB/s instead of 90MB/s . The rebuild process was done in 10 hrs with average 202MB/s. I wish the system log could be more precise about the I/O error (" hard resetting link") , or the drive itself should have reported under-power issue. I hope this will be helpful for someelse.

disk1 disable (after huge read errors), but content not actually emulated by parity drive

Recommended Posts

wilsonhomelab

Link to comment

wilsonhomelab

Link to comment

itimpi

Link to comment

wilsonhomelab

Link to comment

itimpi

Link to comment

wilsonhomelab

Link to comment

itimpi

Link to comment

wilsonhomelab

Link to comment

wilsonhomelab

Link to comment

Join the conversation