My setup is in a tower case with 5 case/CPU fans, a Supermicro X10SL7-F motherboard, 32GB Crucial/Micron ECC RAM, Xeon 1285v3 CPU, 4 iStarUSA 5 in 3s to total a possible 20 HDD (17 slots are filled) and I have two SATA SSD running cache_ssd and all my dockers and processing is done on an NVMe drive installed in a add on card in a slot. It currently has a 650W Seasonic PSU but it’s 10 years old. I’ve been chasing this problem for months now. Drive 15 is my second youngest drive in the system, it’s just over year old. Drive 15 periodically reports errors but every SMART test run, comes back clean. I’ve replaced the breakout cables for the LBA card, I’ve replaced the LBA card, I’ve replaced the CPU, I even tried a new drive in different slots and got the same errors. I installed an additional fan blowing directly on the heatsink of the LBA card. I ordered a new power supply that will be here tomorrow, that’s the last thing I can think of that may be the issue. drive 15 is the drive that keeps reporting errors. I’ve tried it in two different 5 in 3 enclosures, every free slot with the same results. Sometimes, after I try to rebuild the data from parity, it will finish and be okay for a week or two then reports errors again. Sometimes, the rebuilds fail, it reports errors, disables the drive, then it shows up on unassigned devices. I am currently trying to rebuild the drive again and was waiting to see if it would error out again but noticed some weird things in the syslog, so I decided to go ahead and make a post. Could all of this be because of a failing PSU? I usually am able to solve any issues by referencing others posts and have waited to post this but I need some help. mrpunrfsx1-syslog-20240822-1943.zip

To be more solid, you could try one more parity check, if result positive could assume problem cause by plugin, then rule out which one is. I would suggest uninstall disklocation first.

unraid v6.12.11 odd errors - LBA w/drive errors - Page 3 - General Support

September 1, 20241 yr

12 minutes ago, prongATO said:

First, man, your drives get toasty! The highest I’ve ever seen any drive in my system is 38c.

Yes, bad cooling and room temp 30c 😅

Quote

September 1, 20241 yr

Author

image.png.4dfb90fa7c72f145c851a345307ff0d8.png

Quote

September 2, 20241 yr

Author

On 9/1/2024 at 1:17 AM, Vr2Io said:
Checking on different log, there are no clear reason for disk drop, just random time. Once disk drop in array, you will got below in syslog and disk reconnect within 3 second and change drive letter, this was normal disk drop result.
Aug 24 11:42:49 MRPUNRFSX1 kernel: sd 1:0:1:0: device_block, handle(0x0009)
Aug 24 11:42:51 MRPUNRFSX1 kernel: sd 1:0:1:0: device_unblock and setting to running, handle(0x0009)

Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Attached scsi generic sg8 type 0
Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Power-on or device reset occurred
Aug 24 11:42:57 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL ()' is not set to auto mount.
-----

Aug 23 13:57:35 MRPUNRFSX1 kernel: sd 3:0:4:0: device_block, handle(0x000d)
Aug 23 13:57:37 MRPUNRFSX1 kernel: sd 3:0:4:0: device_unblock and setting to running, handle(0x000d)

Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Attached scsi generic sg11 type 0
Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Power-on or device reset occurred
Aug 23 13:57:43 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (dev1)' is not set to auto mount.
-----

Aug 23 16:57:20 MRPUNRFSX1 kernel: sd 4:0:3:0: device_block, handle(0x000c)
Aug 23 16:57:22 MRPUNRFSX1 kernel: sd 4:0:3:0: device_unblock and setting to running, handle(0x000c)

Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Attached scsi generic sg10 type 0
Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Power-on or device reset occurred
Aug 23 16:57:30 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (sdu)' is not set to auto mount.
(1) Pls ensure drop disk fall to UD, you can mount it and read the file content, then you actually not need rebuild it. And after problem fixing, then re-add to array and rebuild-parity.

(2) Stop disk auto spin-down.

(3) Open a terminal, type "dmesg -wT" to keep track does any "I/O error"

or

"tail -f /var/log/syslog" to check any device_block & device_unblock.

You can press "Enter" anytime for easy check any new log after that.

(4) Parity operation should generate maximum I/O and useful stress disk to drop-out, but in this case seems not. So I will suggest some different test hop can reproduce problem ASAP. But I won't know what will trigger this, so just try it.

Pls open 3 terminal, execute below command, concurrent two hash for big file should provide enough loading to CPU, the 3rd one just make another hash for small file.

find /mnt/disk5 -type f -size +8G -exec b3sum {} \;

hash file on disk5 which large then 8GB.

find /mnt/disk6 -type f -size +8G -exec b3sum {} \;

hash file on disk6 which large then 8GB.

find /mnt/user -type f -size -5k -print0 | sort -Rz | xargs -0 -n1 b3sum

Hash file on user share ( random sort ) which smaller then 5k.

While the memtest was running I removed the 3rd 5 in 3 and thoroughly cleaned it out. I ordered a new fan to put in it and will likely do that for all 4 cages if it works well. After I reinstalled the 5 in 3 it rebuilt disk 12 with no issue. I still have the server in a state were I can service the cages. I will stress test it to see if we can make it error again on Wednesday. Right now, the parity is valid and the array is green.

Diagnostics from the latest boot attached. (no disk errors, so far)

mrpunrfsx1-diagnostics-20240902-1825.zip

Quote

1

September 2, 20241 yr

Author

@Vr2Io do you want me to start another parity check to see if we can force another error? I might be able to pull full diagnostics if one comes up.

Quote

September 2, 20241 yr

2 minutes ago, prongATO said:

@Vr2Io do you want me to start another parity check to see if we can force another error? I might be able to pull full diagnostics if one comes up.

Actually this fine for start another parity check, this is current know method to trigger disk drop until you found a quickest way.

Edited September 2, 20241 yr by Vr2Io

Quote

1

September 3, 20241 yr

Author

Just noticed this on a reboot

Quote

September 3, 20241 yr

5 minutes ago, prongATO said:

Just noticed this on a reboot

Never found this type / form error, likely HBA or Mobo related, because disks won't cause this. How many disk detect when booting in Unraid ?

Edited September 3, 20241 yr by Vr2Io

Quote

September 3, 20241 yr

Author

5 minutes ago, Vr2Io said:

Never found this type / form error, likely HBA or Mobo related, because disks won't cause this. How many disk detect when booting in Unraid ?

I have all 8 populated on the LBA card. There is also an LBA card built into the MB. All are full at the moment.

On a reboot the error is cleared

Edited September 3, 20241 yr by prongATO

Quote

September 3, 20241 yr

27 minutes ago, prongATO said:

I have all 8 populated on the LBA card. There is also an LBA card built into the MB. All are full at the moment.

Just notice you have onboard ~~(2008)~~ (2308) and addon HBA ~~(2308)~~ (2008).

Need check does disk drop at either HBA or cross.

Edited September 3, 20241 yr by Vr2Io

Quote

September 3, 20241 yr

Author

6 minutes ago, Vr2Io said:

Just notice you have onboard (2008) and addon HBA (2308).

Need check does disk drop only at onboard HBA.

I thought that was the issue at first, overall with all this, I've had most of the errors on the LBA card, that's why I replaced it already but the error we got on disk 16 was connected to the MB at the time.

Quote

September 3, 20241 yr

Through you have more then 8 disks and you need two controller, can't isolate the controller to test.

Because I thinking if a fault controller will affect another good one, but this I can't confirm.

In normal booting, does LSI BIOS will show two controller in same list ?

The error message said addon ( 2008, 01:00.0 ) fail, but onboard ( 2308, 02:00.0 ) fine.

What are the another addon HBA on hand ?

Edited September 3, 20241 yr by Vr2Io

Quote

September 3, 20241 yr

Author

24 minutes ago, Vr2Io said:

Through you have more then 8 disks and you need two controller, can't isolate the controller to test.

Because I thinking if a fault controller will affect another good one, but this I can't confirm.

In normal booting, does LSI BIOS will show two controller in same list ?

The error message said addon ( 2008, 01:00.0 ) fail, but onboard ( 2308, 02:00.0 ) fine.

yes, there are two discrete controllers. The 9211 and 2308. I'm flashing them with updated firmware now.

Quote

September 3, 20241 yr

2 minutes ago, prongATO said:

yes, there are two discrete controllers. The 9211 and 2308. I'm flashing them with updated firmware now.

Great if you crossflash addon 9211 ( 2008 IR ) to ( 2008 IT ) for best combination.

Edited September 3, 20241 yr by Vr2Io

Quote

September 3, 20241 yr

Author

3 minutes ago, Vr2Io said:

Great if you crossflash addon 9211 ( 2008 IR ) to ( 2008 IT ) for best combination.

They're both flashed to IT mode and both are the flashed to the most stable version 20.00.07.00 firmware and the bios has been updated

image.png.156813de3c94143393b58a0f44d5a5d2.png

Edited September 3, 20241 yr by prongATO

Quote

September 3, 20241 yr

Author

diagnostics from the latest boot for reference.

mrpunrfsx1-diagnostics-20240902-2010.zip

Quote

September 3, 20241 yr

As no problem found after booting in Unraid, so you may ignore the BIOS error and start parity check.

If reboot again, does BIOS always show same error ?

What are the another addon HBA onhand ?

Quote

September 3, 20241 yr

27 minutes ago, prongATO said:

They're both flashed to IT mode and both are the flashed to the most stable verison 20.00.07.00 firmware and the bios has been updated

I haven't any SAS2008 in system, but your one show RAID bus controller, it look like IR firmware, anyway this doesn't matter.

Below is my previous SAS2008 HBA look like.

Edited September 3, 20241 yr by Vr2Io

Quote

September 3, 20241 yr

Author

6 minutes ago, Vr2Io said:

I haven't any SAS2008 in system, but your one show RAID bus controller, it look like IR firmware, anyway this doesn't matter.

Below is my previous SAS2008 HBA look like.

Quote

September 3, 20241 yr

Author

Fingers crossed, I started the parity check last night and it's 75% done with no errors. Surely this all didn't happen because of a dirty 5 in 3 cage?! Yeah, it was a bit dusty and I took it COMPLETELY apart to clean it (complete tear down, clean and reassemble). Maybe the firmware flashes and bios for the LBAs helped also.

Diagnostics attached for reference

mrpunrfsx1-diagnostics-20240903-1457.zip

Edited September 3, 20241 yr by prongATO

Quote

September 4, 20241 yr

Author

I'm VERY encouraged by today's results; parity check finished in just over 13 hours with no errors. I removed the 5 in 3 cage that was connected to the most errors, completely cleaned the whole device out, checked all the solder joints, replaced the fan with a Noctua fan that can now run 24/7 to keep the drives cooler and reassembled the unit.

Again I'm not sure if it was some dust that got into the SATA ports on the backplane of the 5 in 3 or if it was flashing both LBA (onboard and PCIE) with the 20.00.07.00 firmware and updating the BIOS or the work I did on the cages (I took cage 4 apart and cleaned it earlier too) but everything seems to be running fine now. No more random disk errors in the bottom two 5 in 3 cages. I'm going to do the fan modification to the other cages in the coming weeks as well as cleaning all of them thoroughly too.

I'll give it a week or two and if the system continues to run well, I'll close this as solved and reference cleaning the cages and updating the firmware on the cards as the solution. Thank you to every one that's helped me with this process.

Latest diagnostics attached for reference.

mrpunrfsx1-diagnostics-20240903-2114.zip

Quote

2

September 4, 20241 yr

Author

Well, unfortunately I woke up to more errors today. Still disk 12.

mrpunrfsx1-diagnostics-20240904-0835.zip

Quote

September 4, 20241 yr

Author

On 9/2/2024 at 7:18 PM, Vr2Io said:

Just notice you have onboard ~~(2008)~~ (2308) and addon HBA ~~(2308)~~ (2008).

Need check does disk drop at either HBA or cross.

Okay, taking your advice here. After the latest errors on disk 12, I switched the data connection from the LBA card and switched it to a SATA port on the motherboard. Performing rebuild now.

None of this makes any sense. For two days it was hooked up to the LBA card and successfully did a full parity check with no problems. Then, what looks like during the mover script, it has errors early this morning.

Quote

September 4, 20241 yr

That's why I said you need found a way to quickly trigger disk drop. Otherwise you just fall to endless rebuild loop.

Edited September 4, 20241 yr by Vr2Io

Quote

September 4, 20241 yr

Author

6 hours ago, Vr2Io said:

That's why I said you need found a way to quickly trigger disk drop. Otherwise you just fall to endless rebuild loop.

I can't figure out what triggers the errors. I honestly thought that it might have been the backplane of the HDD cage being dirty. The server hadn't made it completely through a parity check in months, until the other day. When this problem started, a drive would report errors and I'd rebuild it and it'd be fine for a few months. Then it got to a few weeks, now it's a few days. I went through my normal hardware troubleshooting and replaced the breakout cables, then the LBA card, then the processor, then the power supply and cables. Really the only thing left is the motherboard. I have a bid on a used Supermicro X10SRL-F and will see in a few days if I can get it.

Edited September 4, 20241 yr by prongATO

Quote

September 4, 20241 yr

How about previous provide test result ? There are no feedback.

Currently we can't confirm does problem come from hardware, it could be software relate, so we need different test or setup to verify.

Edited September 5, 20241 yr by Vr2Io

Quote

unraid v6.12.11 odd errors - LBA w/drive errors

Featured Replies

Solved by Vr2Io

Top Posters In This Topic

Popular Days

Most Popular Posts

prongATO

JorgeB

prongATO

Posted Images

Join the conversation

Top Posters In This Topic

Popular Days

Most Popular Posts

prongATO

JorgeB

prongATO

Posted Images

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)