unraid v6.12.11 odd errors - LBA w/drive errors - General Support

August 22, 20241 yr

My setup is in a tower case with 5 case/CPU fans, a Supermicro X10SL7-F motherboard, 32GB Crucial/Micron ECC RAM, Xeon 1285v3 CPU, 4 iStarUSA 5 in 3s to total a possible 20 HDD (17 slots are filled) and I have two SATA SSD running cache_ssd and all my dockers and processing is done on an NVMe drive installed in a add on card in a slot. It currently has a 650W Seasonic PSU but it’s 10 years old.

I’ve been chasing this problem for months now. Drive 15 is my second youngest drive in the system, it’s just over year old. Drive 15 periodically reports errors but every SMART test run, comes back clean. I’ve replaced the breakout cables for the LBA card, I’ve replaced the LBA card, I’ve replaced the CPU, I even tried a new drive in different slots and got the same errors. I installed an additional fan blowing directly on the heatsink of the LBA card. I ordered a new power supply that will be here tomorrow, that’s the last thing I can think of that may be the issue.

drive 15 is the drive that keeps reporting errors. I’ve tried it in two different 5 in 3 enclosures, every free slot with the same results. Sometimes, after I try to rebuild the data from parity, it will finish and be okay for a week or two then reports errors again. Sometimes, the rebuilds fail, it reports errors, disables the drive, then it shows up on unassigned devices.

I am currently trying to rebuild the drive again and was waiting to see if it would error out again but noticed some weird things in the syslog, so I decided to go ahead and make a post. Could all of this be because of a failing PSU? I usually am able to solve any issues by referencing others posts and have waited to post this but I need some help.

mrpunrfsx1-syslog-20240822-1943.zip

Edited August 22, 20241 yr by prongATO
Added updated syslog with disk errors

Quote

August 22, 20241 yr

Community Expert

There aren't any disks errors in the syslog posted, also, please always post the complete diags instead.

Quote

August 22, 20241 yr

Author

Yes, I know that, I stated that in my message. I saw some weird errors at the end of the syslog I don’t understand, it will eventually happen again. The drive is being rebuilt, it fails with errors 80% of the time and I was going to wait until I saw another error and the drive was disabled but thought someone might be able to glean something from the errors that are there.

Quote

August 22, 20241 yr

Community Expert

If you mean the inotify call trace, that's unrelated to any disk issues.

Quote

August 22, 20241 yr

Author

1 hour ago, JorgeB said:

If you mean the inotify call trace, that's unrelated to any disk issues.

Here’s the updated syslog. It had errors, just like I figured.

mrpunrfsx1-syslog-20240822-1943.zip

Quote

August 23, 20241 yr

Community Expert

14 hours ago, JorgeB said:

please always post the complete diags instead.

Quote

August 23, 20241 yr

Author

13 hours ago, JorgeB said:

My apologies. I rolled back the unRaid version to 6.12.8 and was trying to rebuild the drive again but it just encountered more errors on disk 15. Full diagnostics are attached. Is it odd that the amount of errors are always 1024?

The new Seasonic Prime Gold 850W PS has been installed and I'm attempting to rebuild drive 15 again.

mrpunrfsx1-diagnostics-20240823-1359.zip

Edited August 23, 20241 yr by prongATO
Added a question - updated PS

Quote

August 23, 20241 yr

Author

No joy on the new PS. Downgraded to 6.12.8 because I thought that might make a difference but nope.

Thus far, I've tried replacing the following to solve this issue:

Power supply (Seasonic 660W to Seasonic 850W)

CPU (Xeon 1275v3 to Xeon 1285v3)

LBA card (LSI 9207 to 9211)

LBA breakout cables

the HDD (whatever drive installed as 15 always ends up with write errors)

Installed a fan blowing directly on the LBA card heatsink

At this point, I have no ideas left to try.

mrpunrfsx1-diagnostics-20240823-1710.zip

Quote

August 24, 20241 yr

On 8/23/2024 at 12:47 AM, prongATO said:

Drive 15 periodically reports errors but every SMART test run, comes back clean.

In general, SMART test won't clear the error and according to the log, last 3 extended test was abort by host.

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%      8722         -
# 2  Extended offline    Interrupted (host reset)      90%      8722         -
# 3  Short offline       Completed without error       00%      8699         -
# 4  Short offline       Completed without error       00%      8697         -
# 5  Short offline       Completed without error       00%      8697         -
# 6  Extended offline    Aborted by host               90%      8626         -
# 7  Short offline       Completed without error       00%      8626         -
# 8  Extended offline    Completed without error       00%      7752         -
# 9  Short offline       Completed without error       00%      7737         -
#10  Short offline       Completed without error       00%      1289         -
#11  Extended offline    Completed without error       00%      1017         -
#12  Short offline       Completed without error       00%      1006         -

If according the log, it also record 1 times error only.

Error 1 [0] occurred at disk power-on lifetime: 8449 hours (352 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 a9 a9 87 98 40 00  Error: UNC at LBA = 0xa9a98798 = 2846459800

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 00 00 00 a9 a9 87 90 40 00     04:03:47.620  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 86 90 40 00     04:03:47.619  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 85 90 40 00     04:03:47.618  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 84 90 40 00     04:03:47.616  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 83 90 40 00     04:03:47.615  READ FPDMA QUEUED

On 8/23/2024 at 12:47 AM, prongATO said:

I am currently trying to rebuild the drive again and was waiting to see if it would error out again

Assume completed, any result ?

If same disk periodic failure and have warranty, not bad to go RMA.

Quote

August 24, 20241 yr

Community Expert

10 hours ago, prongATO said:

Downgraded to 6.12.8 because I thought that might make a difference but nope.

Unraid release should not make any difference with disk errors.

Like mentioned by Vr2lo it's been a while since a long SMART test completed, so it would be good to run one.

Quote

1

August 24, 20241 yr

Author

10 hours ago, Vr2Io said:

In general, SMART test won't clear the error and according to the log, last 3 extended test was abort by host.

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%      8722         -
# 2  Extended offline    Interrupted (host reset)      90%      8722         -
# 3  Short offline       Completed without error       00%      8699         -
# 4  Short offline       Completed without error       00%      8697         -
# 5  Short offline       Completed without error       00%      8697         -
# 6  Extended offline    Aborted by host               90%      8626         -
# 7  Short offline       Completed without error       00%      8626         -
# 8  Extended offline    Completed without error       00%      7752         -
# 9  Short offline       Completed without error       00%      7737         -
#10  Short offline       Completed without error       00%      1289         -
#11  Extended offline    Completed without error       00%      1017         -
#12  Short offline       Completed without error       00%      1006         -

If according the log, it also record 1 times error only.

Error 1 [0] occurred at disk power-on lifetime: 8449 hours (352 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 a9 a9 87 98 40 00  Error: UNC at LBA = 0xa9a98798 = 2846459800

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 00 00 00 a9 a9 87 90 40 00     04:03:47.620  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 86 90 40 00     04:03:47.619  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 85 90 40 00     04:03:47.618  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 84 90 40 00     04:03:47.616  READ FPDMA QUEUED
  60 01 00 00 00 00 00 a9 a9 83 90 40 00     04:03:47.615  READ FPDMA QUEUED

Assume completed, any result ?

If same disk periodic failure and have warranty, not bad to go RMA.

I just finished an extended offline test, no errors.

mrpunrfsx1-diagnostics-20240824-0955.zip

Edited August 24, 20241 yr by prongATO
Added reboot diagnostics

Quote

August 24, 20241 yr

Author

I also tested my RAM with the memory test plugin. I left the array stopped and tested 2 loops of 30GB, finding no errors.

Quote

August 24, 20241 yr

Author

Figured I’d give it another shot, same result. Sometimes it rebuilds the data, only to error out a few days later. Sometimes it starts a rebuild and doesn’t finish, sometimes it starts a rebuild and errors out immediately.

mrpunrfsx1-diagnostics-20240824-1144.zip

Edited August 24, 20241 yr by prongATO
Fixed a typo

Quote

August 24, 20241 yr

4 hours ago, prongATO said:

I also tested my RAM with the memory test plugin.

Could you also use built in memory test to test, I found test no. 7 would effectively detect error, so you could only test that for time saving.

For thise case, although it look like disk problem, but the log which record in disk didn't match at all.

As mention, you can

1. Apply RMA on that disk and check does problem gone.

Or

2. Swap another disk with it and check does problem follow on same physical disk or still disk 15.

Edited August 24, 20241 yr by Vr2Io

Quote

August 24, 20241 yr

Author

3 hours ago, Vr2Io said:

Could you also use built in memory test to test, I found test no. 7 would effectively detect error, so you could only test that for time saving.

For thise case, although it look like disk problem, but the log which record in disk didn't match at all.

As mention, you can

1. Apply RMA on that disk and check does problem gone.

Or

2. Swap another disk with it and check does problem follow on same physical disk or still disk 15.

Unfortunately, I've already tried a new disk. I had ordered another 6TB WD Red Plus and it went through a full preclear successfully (WD60EFPX-68C5ZN0_WD-WX22D24N1K9L). Before I used it to upgrade disk 11 from 3TB to 6TB I tried it as a replacement for disk 15 and it exhibited the same behavior. I'm completely at a loss for what the heck is going on. I have 4 other disks attached to the same LBA, 3 HDD and 1 SSD and none of them seem to have any issues. I've been thinking that when I have the money, since this disk is part of my TV Series share, I could get 2 new 6TB disks and replace a 3TB and 4TB disk and move all the data from disk 15 to those disks and then shrink the array, removing disk 15. I've replaced just about everything else I can think of and nothing has worked.

Edited August 24, 20241 yr by prongATO
to explain a possible solution with more clarity

Quote

August 25, 20241 yr

Author

Okay, I re-cabled the entire system. Since I replaced the Seasonic with another Seasonic, the ports for the various cables matched so I didn't replace all the power cabling but I went ahead and did that today, paying special attention to the power cables to the 5 in 3 cages. After doing that, I tried to rebuild the drive, it had errors again. I've tried multiple drives in two different cages and 4 different slots in the system, I have four 5 in 3 cages in the case. I decided to take those out of the equation. With the new power supply I have an extra SATA power port (6x instead of 5x) so I've hooked the drive up directly to the LBA card and on its own power cable and attempting to rebuild the data in a configuration that removes the cages from the path. I'll let you know if it makes any difference and post the diagnostics if it doesn't.

Quote

1

August 25, 20241 yr

4 hours ago, prongATO said:

replacement for disk 15 and it exhibited the same behavior.

That's good, so problem not related to disk itself because no problem when it be disk11.

Due to problem stick to disk 15, would you try empty slot 15 and temporary assign the disk to i.e. 17.

This iused to rule out does something affect disk15 only, as you use single parity so this no hurt to reassign data disk.

Edited August 25, 20241 yr by Vr2Io

Quote

August 25, 20241 yr

Author

3 minutes ago, Vr2Io said:

That's good, so problem not related to disk itself.

Due to problem stick to disk 15, would you try empty slot 15 and temporary assign the disk to i.e. 17.

This is used test does anything affect disk15 only, as you use single parity so this no hurt to reassign data disk.

I'm going to wait and see if my hunch is right. I have a couple of extra 5 in 3 cages on-hand, so if it finishes rebuilding disk 15 while it's running on the desk next to the server, I'll swap out the cage and see if the errors pop up again with a different 5 in 3.

Quote

August 25, 20241 yr

2 minutes ago, prongATO said:

I'm going to wait and see if my hunch is right. I have a couple of extra 5 in 3 cages on-hand, so if it finishes rebuilding disk 15 while it's running on the desk next to the server, I'll swap out the cage and see if the errors pop up again with a different 5 in 3.

This fine, keep going until solve the trouble. 💪💪

Quote

August 25, 20241 yr

Author

Well, with the drive outside the case but connected, disk 15 finished rebuilding. Then disk 16 reported errors and has been disabled. It's in a completely different cage, cage 1 and is connected to the on-motherboard LBA. As 15 finished rebuilding, I had to reboot the server because I couldn't reach the web interface.

Disk 16 is ZFS and is only in the array for backups of my ZFS drives (cache_ssd and cache_nvme) and is only 3TB. I'm still a bit confused as to what is causing all these errors. Since disk 16 is in a completely different 5 in 3 and isn't on the same LBA, I'm at a loss even more than I was before. I should have tried to get the diagnostics from the command line but my IPMI wasn't working and the web interface wouldn't load after the errors on disk 16 showed up. I was able to SSH to the server and perform a graceful shutdown, though.

Edited August 25, 20241 yr by prongATO
additional information

Quote

August 25, 20241 yr

20 hours ago, prongATO said:

I decided to take those out of the equation. With the new power supply I have an extra SATA power port (6x instead of 5x) so I've hooked the drive up directly to the LBA card and on its own power cable and attempting to rebuild the data in a configuration that remov

As another disk got failure, then you can't avoid perform big change to remove all cage.

Although PSU have 6 socket, does bundle 6 cable too ? What sata / molex config for each cable ?

Also, how long those hba to sats cable, does all same ?

Edited August 25, 20241 yr by Vr2Io

Quote

August 26, 20241 yr

Author

6 hours ago, Vr2Io said:

As another disk got failure, then you can't avoid perform big change to remove all cage.

Although PSU have 6 socket, does bundle 6 cable too ? What sata / molex config for each cable ?

Also, how long those hba to sats cable, does all same ?

I'm running 4 cables one for each cage and each one has two Molex connectors per cage. I have a 5th going to the SATA SSDs. I over-specified the wattage for the new power supply when I went with 850W. The previous model was 660W.

I can't really remove all the cages because I wouldn't have enough room for the 17 3.5" HDD I currently have in the system. I think I'm going to get a 10 or 12TB drive when I have the funds and replace the parity first. Then I'll start replacing the 3 and 4TB drives with 10s or 12s and reduce the number of drives in my system. As soon as disk 16 finishes rebuilding and I have a healthy array, I'm going to replace the 5 in 3 that was having the issues. The mind boggling thing is that my parity drive is ALSO in the same cage, so again, none of this makes any sense.

See the attached images for the layout of my drives in the cages. the NVMe SSD is in an addon card and the SATA SSDs are attached to a card that sits in an open pci slot but isn't "plugged in" so I didn't have to use up spots in the 5 in 3s.

Green: drive under warranty

Blue: drive out of warranty

Dark Green: Parity

Orange: SSD/NVMe

Edited August 26, 20241 yr by prongATO
explain layout

Quote

August 26, 20241 yr

34 minutes ago, prongATO said:

I'm running 4 cables one for each cage and each one has two Molex connectors per cage.

Pls ensure each cage have two different cable. So two cable for two cage. If one cable for each cage ( two molex ) not a ideal topology.

Edited August 26, 20241 yr by Vr2Io

Quote

August 26, 20241 yr

Author

14 hours ago, Vr2Io said:

Pls ensure each cage have two different cable. So two cable for two cage. If one cable for each cage ( two molex ) not a ideal topology.

I'll re-cable the PS when I install the other cage today. I previously had SATA cables and Molex for each cage.

The array is good, for now. Disk 16 finished rebuilding with no issues and the extended offline test was good. In all my years working in IT, intermittent issues are always the hardest to figure out. I don't have enough experience with linux/unix to even begin to track these down, I have a rudimentary knowledge, at best.

Quote

August 27, 20241 yr

Author

Okay, this is getting annoying. This has been the first time my array has been healthy in weeks and the drives refuse to spin down. As soon as I spin them down, something emhttpd reads the smart and spins them back up.

I disabled enable spin up groups but some odd things coming up in the log.

mrpunrfsx1-diagnostics-20240826-1957.zip

Edited August 27, 20241 yr by prongATO
added diagnostics

Quote

unraid v6.12.11 odd errors - LBA w/drive errors

Featured Replies

Solved by Vr2Io

Top Posters In This Topic

Popular Days

Most Popular Posts

prongATO

JorgeB

prongATO

Posted Images

Join the conversation

Top Posters In This Topic

Popular Days

Most Popular Posts

prongATO

JorgeB

prongATO

Posted Images

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)