Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

unraid v6.12.11 odd errors - LBA w/drive errors

Featured Replies

12 minutes ago, prongATO said:

First, man, your drives get toasty!  The highest I’ve ever seen any drive in my system is 38c.

Yes, bad cooling and room temp 30c 😅

  • Replies 93
  • Views 5.2k
  • Created
  • Last Reply

Top Posters In This Topic

Most Popular Posts

  • I'm VERY encouraged by today's results; parity check finished in just over 13 hours with no errors.  I removed the 5 in 3 cage that was connected to the most errors, completely cleaned the whole devic

  • Unraid release should not make any difference with disk errors.   Like mentioned by Vr2lo it's been a while since a long SMART test completed, so it would be good to run one.

  • Okay, I re-cabled the entire system.  Since I replaced the Seasonic with another Seasonic, the ports for the various cables matched so I didn't replace all the power cabling but I went ahead and did t

Posted Images

  • Author

image.png.4dfb90fa7c72f145c851a345307ff0d8.png

  • Author
On 9/1/2024 at 1:17 AM, Vr2Io said:

Checking on different log, there are no clear reason for disk drop, just random time. Once disk drop in array, you will got below in syslog and disk reconnect within 3 second and change drive letter, this was normal disk drop result.

 

Aug 24 11:42:49 MRPUNRFSX1 kernel: sd 1:0:1:0: device_block, handle(0x0009)
Aug 24 11:42:51 MRPUNRFSX1 kernel: sd 1:0:1:0: device_unblock and setting to running, handle(0x0009)

Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Attached scsi generic sg8 type 0
Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Power-on or device reset occurred
Aug 24 11:42:57 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL ()' is not set to auto mount.
-----

Aug 23 13:57:35 MRPUNRFSX1 kernel: sd 3:0:4:0: device_block, handle(0x000d)
Aug 23 13:57:37 MRPUNRFSX1 kernel: sd 3:0:4:0: device_unblock and setting to running, handle(0x000d)

Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Attached scsi generic sg11 type 0
Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Power-on or device reset occurred
Aug 23 13:57:43 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (dev1)' is not set to auto mount.
-----

Aug 23 16:57:20 MRPUNRFSX1 kernel: sd 4:0:3:0: device_block, handle(0x000c)
Aug 23 16:57:22 MRPUNRFSX1 kernel: sd 4:0:3:0: device_unblock and setting to running, handle(0x000c)

Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Attached scsi generic sg10 type 0
Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Power-on or device reset occurred
Aug 23 16:57:30 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (sdu)' is not set to auto mount.

 

 

(1) Pls ensure drop disk fall to UD, you can mount it and read the file content, then you actually not need rebuild it. And after problem fixing, then re-add to array and rebuild-parity.

 

(2) Stop disk auto spin-down.

 

(3) Open a terminal, type "dmesg -wT" to keep track does any "I/O error"

or

"tail -f /var/log/syslog" to check any device_block & device_unblock.

You can press "Enter" anytime for easy check any new log after that.

 

(4) Parity operation should generate maximum I/O and useful stress disk to drop-out, but in this case seems not. So I will suggest some different test hop can reproduce problem ASAP. But I won't know what will trigger this, so just try it.

 

Pls open 3 terminal, execute below command, concurrent two hash for big file should provide enough loading to CPU, the 3rd one just make another hash for small file.

 

find /mnt/disk5 -type f -size +8G -exec b3sum {} \;

hash file on disk5 which large then 8GB.

 

find /mnt/disk6 -type f -size +8G -exec b3sum {} \;

hash file on disk6 which large then 8GB.

 

find /mnt/user -type f -size -5k -print0 | sort -Rz | xargs -0 -n1 b3sum

Hash file on user share ( random sort ) which smaller then 5k.

 

image.png.61cf2a39e69f5b9f0a8d8921bb7f3d87.png

 

image.png.07c0d5e190678fba8185f39f9d84c5e1.png

 

 

 

While the memtest was running I removed the 3rd 5 in 3 and thoroughly cleaned it out.  I ordered a new fan to put in it and will likely do that for all 4 cages if it works well.  After I reinstalled the 5 in 3 it rebuilt disk 12 with no issue.  I still have the server in a state were I can service the cages.  I will stress test it to see if we can make it error again on Wednesday.  Right now, the parity is valid and the array is green.

 

Diagnostics from the latest boot attached. (no disk errors, so far)

 

mrpunrfsx1-diagnostics-20240902-1825.zip

  • Author

@Vr2Io  do you want me to start another parity check to see if we can force another error?  I might be able to pull full diagnostics if one comes up.

2 minutes ago, prongATO said:

@Vr2Io  do you want me to start another parity check to see if we can force another error?  I might be able to pull full diagnostics if one comes up.

Actually this fine for start another parity check, this is current know method to trigger disk drop until you found a quickest way.

Edited by Vr2Io

  • Author

Just noticed this on a reboot

 

 

LBA_error.jpg

5 minutes ago, prongATO said:

Just noticed this on a reboot

 

 

LBA_error.jpg

Never found this type / form error, likely HBA or Mobo related, because disks won't cause this. How many disk detect when booting in Unraid ?

Edited by Vr2Io

  • Author
5 minutes ago, Vr2Io said:

Never found this type / form error, likely HBA or Mobo related, because disks won't cause this. How many disk detect when booting in Unraid ?

I have all 8 populated on the LBA card.  There is also an LBA card built into the MB.  All are full at the moment.  

 

On a reboot the error is cleared

Edited by prongATO

27 minutes ago, prongATO said:

I have all 8 populated on the LBA card.  There is also an LBA card built into the MB.  All are full at the moment.  

Just notice you have onboard (2008) (2308) and addon HBA (2308) (2008).

Need check does disk drop at either HBA or cross.

Edited by Vr2Io

  • Author
6 minutes ago, Vr2Io said:

Just notice you have onboard (2008) and addon HBA (2308).

Need check does disk drop only at onboard HBA.

I thought that was the issue at first, overall with all this, I've had most of the errors on the LBA card, that's why I replaced it already but the error we got on disk 16 was connected to the MB at the time.

Through you have more then 8 disks and you need two controller, can't isolate the controller to test.

Because I thinking if a fault controller will affect another good one, but this I can't confirm.

 

In normal booting, does LSI BIOS will show two controller in same list ?

 

The error message said addon ( 2008, 01:00.0 ) fail, but onboard ( 2308, 02:00.0 ) fine.

 

What are the another addon HBA on hand ?

 

 

Edited by Vr2Io

  • Author
24 minutes ago, Vr2Io said:

Through you have more then 8 disks and you need two controller, can't isolate the controller to test.

Because I thinking if a fault controller will affect another good one, but this I can't confirm.

 

In normal booting, does LSI BIOS will show two controller in same list ?

 

The error message said addon ( 2008, 01:00.0 ) fail, but onboard ( 2308, 02:00.0 ) fine.

 

 

yes, there are two discrete controllers.  The 9211 and 2308.  I'm flashing them with updated firmware now.

2 minutes ago, prongATO said:

yes, there are two discrete controllers.  The 9211 and 2308.  I'm flashing them with updated firmware now.

Great if you crossflash addon 9211 ( 2008 IR ) to ( 2008 IT ) for best combination.

Edited by Vr2Io

  • Author
3 minutes ago, Vr2Io said:

Great if you crossflash addon 9211 ( 2008 IR ) to ( 2008 IT ) for best combination.

They're both flashed to IT mode and both are the flashed to the most stable version 20.00.07.00 firmware and the bios has been updated

image.png.156813de3c94143393b58a0f44d5a5d2.png

Edited by prongATO

As no problem found after booting in Unraid, so you may ignore the BIOS error and start parity check.

 

If reboot again, does BIOS always show same error ?

What are the another addon HBA onhand ?

27 minutes ago, prongATO said:

They're both flashed to IT mode and both are the flashed to the most stable verison 20.00.07.00 firmware and the bios has been updated

 

I haven't any SAS2008 in system, but your one show RAID bus controller, it look like IR firmware, anyway this doesn't matter.

 

Below is my previous SAS2008 HBA look like.

 

2008.thumb.PNG.0aee78341307284377821afbef32c0a0.PNG

Edited by Vr2Io

  • Author
6 minutes ago, Vr2Io said:

I haven't any SAS2008 in system, but your one show RAID bus controller, it look like IR firmware, anyway this doesn't matter.

 

Below is my previous SAS2008 HBA look like.

 

2008.thumb.PNG.0aee78341307284377821afbef32c0a0.PNG

image.thumb.png.b1da3ab6731ed40a28284cb400fd4b08.png

  • Author

Fingers crossed, I started the parity check last night and it's 75% done with no errors.  Surely this all didn't happen because of a dirty 5 in 3 cage?!  Yeah, it was a bit dusty and I took it COMPLETELY apart to clean it (complete tear down, clean and reassemble).  Maybe the firmware flashes and bios for the LBAs helped also.  

 

Diagnostics attached for reference

mrpunrfsx1-diagnostics-20240903-1457.zip

Edited by prongATO

  • Author

I'm VERY encouraged by today's results; parity check finished in just over 13 hours with no errors.  I removed the 5 in 3 cage that was connected to the most errors, completely cleaned the whole device out, checked all the solder joints, replaced the fan with a Noctua fan that can now run 24/7 to keep the drives cooler and reassembled the unit.  

 

Again I'm not sure if it was some dust that got into the SATA ports on the backplane of the 5 in 3 or if it was flashing both LBA (onboard and PCIE) with the 20.00.07.00 firmware and updating the BIOS or the work I did on the cages (I took cage 4 apart and cleaned it earlier too) but everything seems to be running fine now.   No more random disk errors in the bottom two 5 in 3 cages.  I'm going to do the fan modification to the other cages in the coming weeks as well as cleaning all of them thoroughly too. 

 

I'll give it a week or two and if the system continues to run well, I'll close this as solved and reference cleaning the cages and updating the firmware on the cards as the solution.  Thank you to every one that's helped me with this process.

 

Latest diagnostics attached for reference.

mrpunrfsx1-diagnostics-20240903-2114.zip

  • Author
On 9/2/2024 at 7:18 PM, Vr2Io said:

Just notice you have onboard (2008) (2308) and addon HBA (2308) (2008).

Need check does disk drop at either HBA or cross.

Okay, taking your advice here.  After the latest errors on disk 12, I switched the data connection from the LBA card and switched it to a SATA port on the motherboard.  Performing rebuild now.   
 

None of this makes any sense.  For two days it was hooked up to the LBA card and successfully did a full parity check with no problems.  Then, what looks like during the mover script, it has errors early this morning. 

That's why I said you need found a way to quickly trigger disk drop. Otherwise you just fall to endless rebuild loop.

Edited by Vr2Io

  • Author
6 hours ago, Vr2Io said:

That's why I said you need found a way to quickly trigger disk drop. Otherwise you just fall to endless rebuild loop.

I can't figure out what triggers the errors.  I honestly thought that it might have been the backplane of the HDD cage being dirty.  The server hadn't made it completely through a parity check in months, until the other day.  When this problem started, a drive would report errors and I'd rebuild it and it'd be fine for a few months.  Then it got to a few weeks, now it's a few days.  I went through my normal hardware troubleshooting and replaced the breakout cables, then the LBA card, then the processor, then the power supply and cables.  Really the only thing left is the motherboard.  I have a bid on a used Supermicro X10SRL-F and will see in a few days if I can get it.

 

 

Edited by prongATO

How about previous provide test result ? There are no feedback.

 

Currently we can't confirm does problem come from hardware, it could be software relate, so we need different test or setup to verify.

Edited by Vr2Io

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.