HAMANY Posted October 9, 2022 Share Posted October 9, 2022 Hi all, It started with Disk 9 suddenly getting "UDMA CRC Error Count" and it got disabled. I changed the sata cable/port and rebuilt the array. After few hours Disk 7 got disabled because of read errors, and Disk 8 had also 300k+ read errors but wasn't disabled. I think there is something wrong with my setup, as I'm getting a lot of UDMA CRC Errors on different drives from time to time. I started losing believe on it. I've 14 drives connected to: - Motherboard 6x Sata ports - 1x SAS card 2 ports to 8 Sata - 1x Sata card with 2 Sata ports Based on your experience, what is your recommended next action? Thank you. tower-diagnostics-20221009-1751.zip Quote Link to comment
HAMANY Posted October 10, 2022 Author Share Posted October 10, 2022 It seems that Disk 8 is already disabled but unRaid is showing that it's online. I noticed some files got disappeared from my shares. What is the best way to minimize losing more data? Quote Link to comment
Solution JorgeB Posted October 10, 2022 Solution Share Posted October 10, 2022 There are issues with 3 disks, so parity cannot emulate them all, almost simultaneous errors like these are usually the result of power/connection issues, power down, check/replace cables on the problem disks and post new diags after array start. 1 Quote Link to comment
HAMANY Posted October 10, 2022 Author Share Posted October 10, 2022 10 hours ago, JorgeB said: There are issues with 3 disks, so parity cannot emulate them all, almost simultaneous errors like these are usually the result of power/connection issues, power down, check/replace cables on the problem disks and post new diags after array start. Here you go. Diags attached. tower-diagnostics-20221010-2323.zip Quote Link to comment
JorgeB Posted October 11, 2022 Share Posted October 11, 2022 Oct 10 23:22:19 Tower kernel: ata10: reset failed, giving up Oct 10 23:22:19 Tower kernel: ata10.00: disabled Disk9 dropped offline again, did you replace both cables? 1 Quote Link to comment
HAMANY Posted October 11, 2022 Author Share Posted October 11, 2022 5 minutes ago, JorgeB said: Oct 10 23:22:19 Tower kernel: ata10: reset failed, giving up Oct 10 23:22:19 Tower kernel: ata10.00: disabled Disk9 dropped offline again, did you replace both cables? Yes I did. Let me replace the whole SATA power cables and try again. I'm using StarTech SATA power splitter Thank you for your response. If you've any other suggestions for the SATA and power cables, please let me know. Quote Link to comment
JorgeB Posted October 11, 2022 Share Posted October 11, 2022 21 minutes ago, HAMANY said: I'm using StarTech SATA power splitter Those are not good, you should not have more than two drives on a SATA splitter. 1 Quote Link to comment
HAMANY Posted October 15, 2022 Author Share Posted October 15, 2022 On 10/11/2022 at 12:09 PM, JorgeB said: Oct 10 23:22:19 Tower kernel: ata10: reset failed, giving up Oct 10 23:22:19 Tower kernel: ata10.00: disabled Disk9 dropped offline again, did you replace both cables? I replaced all the splitter cables, check all the connections and restarted the server. I removed all Sata splitters, only use 2 molex to 2x Sata. Disks 7 and 9 looks they are disabled and shown as "Unmountable disks". Is there any way to rebuild them using the parity? Do you think it's wise to mount the drives on Windows and copy all the files from these 2 disks before doing the rebuilding? Appreciate your advise. Thank you. tower-diagnostics-20221015-2337.zip Quote Link to comment
itimpi Posted October 15, 2022 Share Posted October 15, 2022 Parity cannot fix a disk showing as unmountable. The correct handling of unmountable disks is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page. 1 Quote Link to comment
HAMANY Posted October 15, 2022 Author Share Posted October 15, 2022 (edited) 22 minutes ago, itimpi said: Parity cannot fix a disk showing as unmountable. The correct handling of unmountable disks is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page. What parameters should I use for the "xfs_repair"? I checked both disks using "-n" and received the below output Kindly note that, one of the disks was rebuilding before is gets disabled again and become unmountable. Would have this corrupt my files? Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. tower-diagnostics-20221016-0042.zip Edited October 15, 2022 by HAMANY Disgs Quote Link to comment
itimpi Posted October 16, 2022 Share Posted October 16, 2022 Are you running the repair from the GU/I or the command line? If the latter what was the exact command you used? I am checking as many people get the command slightly wrong using the command line. to actually do a repair you remove the -n option, and if it subsequently asks for it you add the -L optionT 1 Quote Link to comment
HAMANY Posted October 16, 2022 Author Share Posted October 16, 2022 (edited) 3 hours ago, itimpi said: Are you running the repair from the GU/I or the command line? If the latter what was the exact command you used? I am checking as many people get the command slightly wrong using the command line. to actually do a repair you remove the -n option, and if it subsequently asks for it you add the -L optionT I'm running the GUI. Removed the -n and got the below outputs. Should I proceed with the "-L" ? Disk 7 Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Disk 9 Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 resetting superblock root inode pointer to 128 sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap inode pointer to 129 sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary inode pointer to 130 Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Edited October 16, 2022 by HAMANY Quote Link to comment
JorgeB Posted October 16, 2022 Share Posted October 16, 2022 3 hours ago, HAMANY said: Should I proceed with the "-L" ? Yes, it's the only option since the disks cannot be mounted to clear the log. 1 Quote Link to comment
HAMANY Posted October 16, 2022 Author Share Posted October 16, 2022 (edited) 31 minutes ago, JorgeB said: Yes, it's the only option since the disks cannot be mounted to clear the log. Thank you for your cooperation. Done for both desks with no errors. The last 2 lines are Format log to cycle 4. done I stopped the array then started it in normal mode as stated in the documentation. The 2 drives still have red x sign near the disk name and the status is "disabled". The entire drives contents were moved to the "lost+found" share. What should I do next? tower-diagnostics-20221016-1244.zip Edited October 16, 2022 by HAMANY Quote Link to comment
JorgeB Posted October 16, 2022 Share Posted October 16, 2022 7 minutes ago, HAMANY said: The entire drives contents were moved to the "lost+found" share. In that case it might be better to rebuild parity instead, post current diags first. 1 Quote Link to comment
itimpi Posted October 16, 2022 Share Posted October 16, 2022 At this point the repair has been run against the ‘emulated’ drive and the physical disabled dtive is untouched. If you look at the contents of the Lost+Found folder do you think you can sort the contents out or not? Entries being put there means the repair process could not locate their directory entry to give them the correct name. If the contents look like to much work to resolve what is the state of your backups? keep the physical ‘disabled’ drives intact for now as depending on your answers we may recommend different ways forward. 1 Quote Link to comment
HAMANY Posted October 16, 2022 Author Share Posted October 16, 2022 7 minutes ago, JorgeB said: In that case it might be better to rebuild parity instead, post current diags first. I posted the latest diags in the previous reply after starting the array in normal mode. 7 minutes ago, itimpi said: At this point the repair has been run against the ‘emulated’ drive and the physical disabled dtive is untouched. If you look at the contents of the Lost+Found folder do you think you can sort the contents out or not? Entries being put there means the repair process could not locate their directory entry to give them the correct name. If the contents look like to much work to resolve what is the state of your backups? keep the physical ‘disabled’ drives intact for now as depending on your answers we may recommend different ways forward. The Lost+Found folder contains folders and files with random numbers. The folders contain my files with the original naming with the correct file extension ✔️. The files are just renamed to numbers with no extension ❌. I don't have 1:1 backup for all the files, as most of them are available online. I've backups for some personal folders. I will do the follow, and please let me know what do you think. - I will copy the entire Lost+Found folder to an external drive, it's around (11.5TB) - Will rebuild parity after your confirmation - If there any corrupted/missing files, I will restore them from the backups or re-download them. Appreciate your advise. Quote Link to comment
JorgeB Posted October 16, 2022 Share Posted October 16, 2022 34 minutes ago, HAMANY said: I posted the latest diags in the previous reply after starting the array in normal mode. Yes, sorry, both disks look healthy, with the array stopped, unassign both, start the array, stop the array, then see if both unassigned disks mount with the UD plugin, if yes check that contents look OK. 1 Quote Link to comment
HAMANY Posted October 16, 2022 Author Share Posted October 16, 2022 47 minutes ago, JorgeB said: Yes, sorry, both disks look healthy, with the array stopped, unassign both, start the array, stop the array, then see if both unassigned disks mount with the UD plugin, if yes check that contents look OK. Thanks. Should I copy my data before doing these steps? Quote Link to comment
JorgeB Posted October 16, 2022 Share Posted October 16, 2022 No need, that won't change anything for now. 1 Quote Link to comment
HAMANY Posted October 16, 2022 Author Share Posted October 16, 2022 1 hour ago, JorgeB said: No need, that won't change anything for now. One of them is mounting fine (sdm) and I can see my data in the same structure. The other one is not mounting (sde), I get the below output when I click on the "file system check" Diags attached. There is a button called "Run with correct flag", should I try it? FS: xfs Executing file system check: /sbin/xfs_repair -n /dev/sde1 2>&1 Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... would write modified primary superblock Primary superblock would have been modified. Cannot proceed further in no_modify mode. Exiting now. File system corruption detected! tower-diagnostics-20221016-1555.zip Quote Link to comment
JorgeB Posted October 17, 2022 Share Posted October 17, 2022 19 hours ago, HAMANY said: There is a button called "Run with correct flag", should I try it? Yes. 1 Quote Link to comment
HAMANY Posted October 17, 2022 Author Share Posted October 17, 2022 13 hours ago, JorgeB said: Yes. Almost done copying my files. As for assigning the 2 disks back to the array again and start rebuilding, should I add both disks together? or one by one (Finish rebuilding one then add the second)? Thanks Quote Link to comment
trurl Posted October 17, 2022 Share Posted October 17, 2022 If you rebuild the disks you will get exactly what is shown with the emulated disks, with all that lost+found. On 10/16/2022 at 5:53 AM, JorgeB said: In that case it might be better to rebuild parity instead If you rebuild parity instead, the drives will have their current contents as seen when you mount them with Unassigned Devices. 1 Quote Link to comment
HAMANY Posted October 17, 2022 Author Share Posted October 17, 2022 (edited) 2 minutes ago, trurl said: If you rebuild the disks you will get exactly what is shown with the emulated disks, with all that lost+found. If you rebuild parity instead, the drives will have their current contents as seen when you mount them with Unassigned Devices. How can I choose between rebuilding the disks or parity? Rebuilding the parity is more suitable for me. Edited October 17, 2022 by HAMANY Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.