August 22, 20241 yr My setup is in a tower case with 5 case/CPU fans, a Supermicro X10SL7-F motherboard, 32GB Crucial/Micron ECC RAM, Xeon 1285v3 CPU, 4 iStarUSA 5 in 3s to total a possible 20 HDD (17 slots are filled) and I have two SATA SSD running cache_ssd and all my dockers and processing is done on an NVMe drive installed in a add on card in a slot. It currently has a 650W Seasonic PSU but it’s 10 years old. I’ve been chasing this problem for months now. Drive 15 is my second youngest drive in the system, it’s just over year old. Drive 15 periodically reports errors but every SMART test run, comes back clean. I’ve replaced the breakout cables for the LBA card, I’ve replaced the LBA card, I’ve replaced the CPU, I even tried a new drive in different slots and got the same errors. I installed an additional fan blowing directly on the heatsink of the LBA card. I ordered a new power supply that will be here tomorrow, that’s the last thing I can think of that may be the issue. drive 15 is the drive that keeps reporting errors. I’ve tried it in two different 5 in 3 enclosures, every free slot with the same results. Sometimes, after I try to rebuild the data from parity, it will finish and be okay for a week or two then reports errors again. Sometimes, the rebuilds fail, it reports errors, disables the drive, then it shows up on unassigned devices. I am currently trying to rebuild the drive again and was waiting to see if it would error out again but noticed some weird things in the syslog, so I decided to go ahead and make a post. Could all of this be because of a failing PSU? I usually am able to solve any issues by referencing others posts and have waited to post this but I need some help. mrpunrfsx1-syslog-20240822-1943.zip Edited August 22, 20241 yr by prongATO Added updated syslog with disk errors
August 22, 20241 yr Community Expert There aren't any disks errors in the syslog posted, also, please always post the complete diags instead.
August 22, 20241 yr Author Yes, I know that, I stated that in my message. I saw some weird errors at the end of the syslog I don’t understand, it will eventually happen again. The drive is being rebuilt, it fails with errors 80% of the time and I was going to wait until I saw another error and the drive was disabled but thought someone might be able to glean something from the errors that are there.
August 22, 20241 yr Community Expert If you mean the inotify call trace, that's unrelated to any disk issues.
August 22, 20241 yr Author 1 hour ago, JorgeB said: If you mean the inotify call trace, that's unrelated to any disk issues. Here’s the updated syslog. It had errors, just like I figured. mrpunrfsx1-syslog-20240822-1943.zip
August 23, 20241 yr Community Expert 14 hours ago, JorgeB said: please always post the complete diags instead.
August 23, 20241 yr Author 13 hours ago, JorgeB said: My apologies. I rolled back the unRaid version to 6.12.8 and was trying to rebuild the drive again but it just encountered more errors on disk 15. Full diagnostics are attached. Is it odd that the amount of errors are always 1024? The new Seasonic Prime Gold 850W PS has been installed and I'm attempting to rebuild drive 15 again. mrpunrfsx1-diagnostics-20240823-1359.zip Edited August 23, 20241 yr by prongATO Added a question - updated PS
August 23, 20241 yr Author No joy on the new PS. Downgraded to 6.12.8 because I thought that might make a difference but nope. Thus far, I've tried replacing the following to solve this issue: Power supply (Seasonic 660W to Seasonic 850W) CPU (Xeon 1275v3 to Xeon 1285v3) LBA card (LSI 9207 to 9211) LBA breakout cables the HDD (whatever drive installed as 15 always ends up with write errors) Installed a fan blowing directly on the LBA card heatsink At this point, I have no ideas left to try. mrpunrfsx1-diagnostics-20240823-1710.zip
August 24, 20241 yr On 8/23/2024 at 12:47 AM, prongATO said: Drive 15 periodically reports errors but every SMART test run, comes back clean. In general, SMART test won't clear the error and according to the log, last 3 extended test was abort by host. SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Aborted by host 90% 8722 - # 2 Extended offline Interrupted (host reset) 90% 8722 - # 3 Short offline Completed without error 00% 8699 - # 4 Short offline Completed without error 00% 8697 - # 5 Short offline Completed without error 00% 8697 - # 6 Extended offline Aborted by host 90% 8626 - # 7 Short offline Completed without error 00% 8626 - # 8 Extended offline Completed without error 00% 7752 - # 9 Short offline Completed without error 00% 7737 - #10 Short offline Completed without error 00% 1289 - #11 Extended offline Completed without error 00% 1017 - #12 Short offline Completed without error 00% 1006 - If according the log, it also record 1 times error only. Error 1 [0] occurred at disk power-on lifetime: 8449 hours (352 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 a9 a9 87 98 40 00 Error: UNC at LBA = 0xa9a98798 = 2846459800 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 01 00 00 00 00 00 a9 a9 87 90 40 00 04:03:47.620 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 86 90 40 00 04:03:47.619 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 85 90 40 00 04:03:47.618 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 84 90 40 00 04:03:47.616 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 83 90 40 00 04:03:47.615 READ FPDMA QUEUED On 8/23/2024 at 12:47 AM, prongATO said: I am currently trying to rebuild the drive again and was waiting to see if it would error out again Assume completed, any result ? If same disk periodic failure and have warranty, not bad to go RMA.
August 24, 20241 yr Community Expert 10 hours ago, prongATO said: Downgraded to 6.12.8 because I thought that might make a difference but nope. Unraid release should not make any difference with disk errors. Like mentioned by Vr2lo it's been a while since a long SMART test completed, so it would be good to run one.
August 24, 20241 yr Author 10 hours ago, Vr2Io said: In general, SMART test won't clear the error and according to the log, last 3 extended test was abort by host. SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Aborted by host 90% 8722 - # 2 Extended offline Interrupted (host reset) 90% 8722 - # 3 Short offline Completed without error 00% 8699 - # 4 Short offline Completed without error 00% 8697 - # 5 Short offline Completed without error 00% 8697 - # 6 Extended offline Aborted by host 90% 8626 - # 7 Short offline Completed without error 00% 8626 - # 8 Extended offline Completed without error 00% 7752 - # 9 Short offline Completed without error 00% 7737 - #10 Short offline Completed without error 00% 1289 - #11 Extended offline Completed without error 00% 1017 - #12 Short offline Completed without error 00% 1006 - If according the log, it also record 1 times error only. Error 1 [0] occurred at disk power-on lifetime: 8449 hours (352 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 a9 a9 87 98 40 00 Error: UNC at LBA = 0xa9a98798 = 2846459800 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 01 00 00 00 00 00 a9 a9 87 90 40 00 04:03:47.620 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 86 90 40 00 04:03:47.619 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 85 90 40 00 04:03:47.618 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 84 90 40 00 04:03:47.616 READ FPDMA QUEUED 60 01 00 00 00 00 00 a9 a9 83 90 40 00 04:03:47.615 READ FPDMA QUEUED Assume completed, any result ? If same disk periodic failure and have warranty, not bad to go RMA. I just finished an extended offline test, no errors. mrpunrfsx1-diagnostics-20240824-0955.zip Edited August 24, 20241 yr by prongATO Added reboot diagnostics
August 24, 20241 yr Author I also tested my RAM with the memory test plugin. I left the array stopped and tested 2 loops of 30GB, finding no errors.
August 24, 20241 yr Author Figured I’d give it another shot, same result. Sometimes it rebuilds the data, only to error out a few days later. Sometimes it starts a rebuild and doesn’t finish, sometimes it starts a rebuild and errors out immediately. mrpunrfsx1-diagnostics-20240824-1144.zip Edited August 24, 20241 yr by prongATO Fixed a typo
August 24, 20241 yr 4 hours ago, prongATO said: I also tested my RAM with the memory test plugin. Could you also use built in memory test to test, I found test no. 7 would effectively detect error, so you could only test that for time saving. For thise case, although it look like disk problem, but the log which record in disk didn't match at all. As mention, you can 1. Apply RMA on that disk and check does problem gone. Or 2. Swap another disk with it and check does problem follow on same physical disk or still disk 15. Edited August 24, 20241 yr by Vr2Io
August 24, 20241 yr Author 3 hours ago, Vr2Io said: Could you also use built in memory test to test, I found test no. 7 would effectively detect error, so you could only test that for time saving. For thise case, although it look like disk problem, but the log which record in disk didn't match at all. As mention, you can 1. Apply RMA on that disk and check does problem gone. Or 2. Swap another disk with it and check does problem follow on same physical disk or still disk 15. Unfortunately, I've already tried a new disk. I had ordered another 6TB WD Red Plus and it went through a full preclear successfully (WD60EFPX-68C5ZN0_WD-WX22D24N1K9L). Before I used it to upgrade disk 11 from 3TB to 6TB I tried it as a replacement for disk 15 and it exhibited the same behavior. I'm completely at a loss for what the heck is going on. I have 4 other disks attached to the same LBA, 3 HDD and 1 SSD and none of them seem to have any issues. I've been thinking that when I have the money, since this disk is part of my TV Series share, I could get 2 new 6TB disks and replace a 3TB and 4TB disk and move all the data from disk 15 to those disks and then shrink the array, removing disk 15. I've replaced just about everything else I can think of and nothing has worked. Edited August 24, 20241 yr by prongATO to explain a possible solution with more clarity
August 25, 20241 yr Author Okay, I re-cabled the entire system. Since I replaced the Seasonic with another Seasonic, the ports for the various cables matched so I didn't replace all the power cabling but I went ahead and did that today, paying special attention to the power cables to the 5 in 3 cages. After doing that, I tried to rebuild the drive, it had errors again. I've tried multiple drives in two different cages and 4 different slots in the system, I have four 5 in 3 cages in the case. I decided to take those out of the equation. With the new power supply I have an extra SATA power port (6x instead of 5x) so I've hooked the drive up directly to the LBA card and on its own power cable and attempting to rebuild the data in a configuration that removes the cages from the path. I'll let you know if it makes any difference and post the diagnostics if it doesn't.
August 25, 20241 yr 4 hours ago, prongATO said: replacement for disk 15 and it exhibited the same behavior. That's good, so problem not related to disk itself because no problem when it be disk11. Due to problem stick to disk 15, would you try empty slot 15 and temporary assign the disk to i.e. 17. This iused to rule out does something affect disk15 only, as you use single parity so this no hurt to reassign data disk. Edited August 25, 20241 yr by Vr2Io
August 25, 20241 yr Author 3 minutes ago, Vr2Io said: That's good, so problem not related to disk itself. Due to problem stick to disk 15, would you try empty slot 15 and temporary assign the disk to i.e. 17. This is used test does anything affect disk15 only, as you use single parity so this no hurt to reassign data disk. I'm going to wait and see if my hunch is right. I have a couple of extra 5 in 3 cages on-hand, so if it finishes rebuilding disk 15 while it's running on the desk next to the server, I'll swap out the cage and see if the errors pop up again with a different 5 in 3.
August 25, 20241 yr 2 minutes ago, prongATO said: I'm going to wait and see if my hunch is right. I have a couple of extra 5 in 3 cages on-hand, so if it finishes rebuilding disk 15 while it's running on the desk next to the server, I'll swap out the cage and see if the errors pop up again with a different 5 in 3. This fine, keep going until solve the trouble. 💪💪
August 25, 20241 yr Author Well, with the drive outside the case but connected, disk 15 finished rebuilding. Then disk 16 reported errors and has been disabled. It's in a completely different cage, cage 1 and is connected to the on-motherboard LBA. As 15 finished rebuilding, I had to reboot the server because I couldn't reach the web interface. Disk 16 is ZFS and is only in the array for backups of my ZFS drives (cache_ssd and cache_nvme) and is only 3TB. I'm still a bit confused as to what is causing all these errors. Since disk 16 is in a completely different 5 in 3 and isn't on the same LBA, I'm at a loss even more than I was before. I should have tried to get the diagnostics from the command line but my IPMI wasn't working and the web interface wouldn't load after the errors on disk 16 showed up. I was able to SSH to the server and perform a graceful shutdown, though. Edited August 25, 20241 yr by prongATO additional information
August 25, 20241 yr 20 hours ago, prongATO said: I decided to take those out of the equation. With the new power supply I have an extra SATA power port (6x instead of 5x) so I've hooked the drive up directly to the LBA card and on its own power cable and attempting to rebuild the data in a configuration that remov As another disk got failure, then you can't avoid perform big change to remove all cage. Although PSU have 6 socket, does bundle 6 cable too ? What sata / molex config for each cable ? Also, how long those hba to sats cable, does all same ? Edited August 25, 20241 yr by Vr2Io
August 26, 20241 yr Author 6 hours ago, Vr2Io said: As another disk got failure, then you can't avoid perform big change to remove all cage. Although PSU have 6 socket, does bundle 6 cable too ? What sata / molex config for each cable ? Also, how long those hba to sats cable, does all same ? I'm running 4 cables one for each cage and each one has two Molex connectors per cage. I have a 5th going to the SATA SSDs. I over-specified the wattage for the new power supply when I went with 850W. The previous model was 660W. I can't really remove all the cages because I wouldn't have enough room for the 17 3.5" HDD I currently have in the system. I think I'm going to get a 10 or 12TB drive when I have the funds and replace the parity first. Then I'll start replacing the 3 and 4TB drives with 10s or 12s and reduce the number of drives in my system. As soon as disk 16 finishes rebuilding and I have a healthy array, I'm going to replace the 5 in 3 that was having the issues. The mind boggling thing is that my parity drive is ALSO in the same cage, so again, none of this makes any sense. See the attached images for the layout of my drives in the cages. the NVMe SSD is in an addon card and the SATA SSDs are attached to a card that sits in an open pci slot but isn't "plugged in" so I didn't have to use up spots in the 5 in 3s. Green: drive under warranty Blue: drive out of warranty Dark Green: Parity Orange: SSD/NVMe Edited August 26, 20241 yr by prongATO explain layout
August 26, 20241 yr 34 minutes ago, prongATO said: I'm running 4 cables one for each cage and each one has two Molex connectors per cage. Pls ensure each cage have two different cable. So two cable for two cage. If one cable for each cage ( two molex ) not a ideal topology. Edited August 26, 20241 yr by Vr2Io
August 26, 20241 yr Author 14 hours ago, Vr2Io said: Pls ensure each cage have two different cable. So two cable for two cage. If one cable for each cage ( two molex ) not a ideal topology. I'll re-cable the PS when I install the other cage today. I previously had SATA cables and Molex for each cage. The array is good, for now. Disk 16 finished rebuilding with no issues and the extended offline test was good. In all my years working in IT, intermittent issues are always the hardest to figure out. I don't have enough experience with linux/unix to even begin to track these down, I have a rudimentary knowledge, at best.
August 27, 20241 yr Author Okay, this is getting annoying. This has been the first time my array has been healthy in weeks and the drives refuse to spin down. As soon as I spin them down, something emhttpd reads the smart and spins them back up. I disabled enable spin up groups but some odd things coming up in the log. mrpunrfsx1-diagnostics-20240826-1957.zip Edited August 27, 20241 yr by prongATO added diagnostics
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.