Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

unraid v6.12.11 odd errors - LBA w/drive errors

Featured Replies

  • Author

Well, that didn’t last long.  At about 3:45, drive 16 reported errors again.  I have reboot because this happens quite a bit but this segfault normally happens during a rebuild rather than an initial error, hopefully you can see the “segfault at IP” error and when that happens it partially locks up the http interface.  Nothing I try to do involving the array will work when this specific set of errors occur.  
 

I didn’t have time to change the cage today but will tomorrow.  I also find it interesting that it’s disk 16 again with errors. (ZFS disk just for automated backups of my other ZFS files)

 

diagnostics attached. 
 

any ideas or insight?

 

mrpunrfsx1-diagnostics-20240828-0400.zip

Edited by prongATO

  • Replies 93
  • Views 5.2k
  • Created
  • Last Reply

Top Posters In This Topic

Most Popular Posts

  • I'm VERY encouraged by today's results; parity check finished in just over 13 hours with no errors.  I removed the 5 in 3 cage that was connected to the most errors, completely cleaned the whole devic

  • Unraid release should not make any difference with disk errors.   Like mentioned by Vr2lo it's been a while since a long SMART test completed, so it would be good to run one.

  • Okay, I re-cabled the entire system.  Since I replaced the Seasonic with another Seasonic, the ports for the various cables matched so I didn't replace all the power cabling but I went ahead and did t

Posted Images

  • Community Expert

Those are not diags.

  • Author
18 hours ago, JorgeB said:

Those are not diags.

yup chose the wrong file, I'll fix it.  Thanks for the heads-up

  • Community Expert

Still looks like a power/connection to me.

 

There are also NIC issues logged, and emhttp segfaulted, so you'll need to reboot.

  • Author
21 hours ago, JorgeB said:

Still looks like a power/connection to me.

 

There are also NIC issues logged, and emhttp segfaulted, so you'll need to reboot.

I rearranged some drives, changed the power to one connection PS to two moles per cage.  Then the last two ports I ran a SATA power connection to two cages her port.   

 

Drive 16 rebuilt no problem.  Do you think it was just that as I kept putting more and more drives in, the power just couldn't keep up?

 

I am going to remove disk 16 soon because it’s onky there for ZFS replication of my other ZFS drives.  Follow this guide?  https://docs.unraid.net/legacy/FAQ/shrink-array/

 

I think diagnostics look good. I am going to let the server run for a few days before I try to shrink the array. I don’t want any errors to pop up while that’s happening. 

mrpunrfsx1-diagnostics-20240830-0115.zip

Edited by prongATO

  • Community Expert
1 hour ago, prongATO said:

Do you think it was just that as I kept putting more and more drives in, the power just couldn't keep up?

It's possible.

 

Diags look good so far.

  • Author
5 hours ago, JorgeB said:

It's possible.

 

Diags look good so far.

IMO, it is the mostly likely solution. When I took disk 15 out of the 5 in 3 and hooked it directly to a standalone SATA power connection on the new 850W PS, it rebuilt with no issues when I couldn’t get it to rebuild before. None of the issues started until I started adding drives to the 4th cage in my system.  So, it stands to reason that the more drives I added the more power starved they became, which it looks like, can cause unraid to report errors.

10 hours ago, prongATO said:

Drive 16 rebuilt no problem.  Do you think it was just that as I kept putting more and more drives in, the power just couldn't keep up?

Not likely insufficient power, my 16 disk build was use a 850w PSU, it max under 600w and average ~350w, but you need understand the 5v come from 12v and usually provide combine power ~100w no matter PSU are 850w or 1500w.

 

The reason for each cage need two different cable is to minimise the resistance and avoid voltage drop too much. Let's simple say, if pass 1A to wire for 1ohm, then voltage drop is 1v, if that for 12v, it will be 11/12 drop 8.3%, but if 5v 4/5 will be 20%.

 

So connect through cage to disks, it will increase in resistance and you need minimize it.

 

If reason are insufficient power, then no matter how you connect those disks, you still will got power shorter.

Edited by Vr2Io

  • Author
On 8/29/2024 at 3:43 AM, JorgeB said:

Still looks like a power/connection to me.

 

There are also NIC issues logged, and emhttp segfaulted, so you'll need to reboot.

I am 100% the nic issue is a cable.  I was a dummy and left the patch cables hanging off the connector.  They are really nice certified Cat6A patch cables, so I used some zip ties and added support for the cables instead of just letting the weight of the 10' cables just hang off the network connectors.

  • Author
6 hours ago, Vr2Io said:

Not likely insufficient power, my 16 disk build was use a 850w PSU, it max under 600w and average ~350w, but you need understand the 5v come from 12v and usually provide combine power ~100w no matter PSU are 850w or 1500w.

 

The reason for each cage need two different cable is to minimise the resistance and avoid voltage drop too much. Let's simple say, if pass 1A to wire for 1ohm, then voltage drop is 1v, if that for 12v, it will be 11/12 drop 8.3%, but if 5v 4/5 will be 20%.

 

So connect through cage to disks, it will increase in resistance and you need minimize it.

 

If reason are insufficient power, then no matter how you connect those disks, you still will got power shorter.

I just finished reconfiguring everything inside.  Since I only have 6 ports on the power supply for SATA/Molex cables, I used 4 or the ports to run dual Molex connectors to each cage.  I moved the SATA SSDs into cage 4 instead of having them inside the case on a bracket.  I used the remaining two SATA/Molex power ports on the power supply to run a SATA power connection to two cages, then the other to the other two.   Since I know I'm going to start reducing the number of drives I have in my system to avoid future power considerations, I don't feel it's inefficient to use some of the trays for SSDs now. (see attached pic of the new layout, the only drive missing is the NVMe that's on a PCIE addon card)  I am going to start getting 12TB drives, starting with the parity drive, of course.  Then I'll start moving data off some of the old 3TB  and 4TB drives and reallocating the 6TB drives to replace the others.  I want to stay below 15 total HDD in the system from now on.

 

I wanted to get everything squared-away before midnight tonight, when my monthly parity check runs.  I'm considering shrinking the array (removing disk 16 [3TB] because it's only there for ZFS replication and snapshots)

updated layout.png

IMG_4842.JPG

Edited by prongATO
added photo of server for reference

  • Author

I went ahead and started a parity check today since I realized the 1st will be in another day.  I want to let the system hammer the drives for a day or so to make sure everything is copacetic with power/error issues now.  If the parity check completes and I successfully downsize the array, removing disk 16, I will mark this as solved.  @Vr2Io and @JorgeB, thank you both so much for your help with all of this.  I know it was a bunch of trial and error but it was really nice to have two different perspectives while attempting to solve the issues.  

  • Author

well, $hit..   Diagnostics attached.   Happened during a parity check.  Now it's disk 12

 

is it just a coincidence that the amount of errors are always divisible by 1024?

 

 

mrpunrfsx1-diagnostics-20240831-0114.zip

Edited by prongATO

  • Author

Tried rebuilding disk 12, I am back where I started.  I’m about to just give up and throw this thing out the window.

 

i am just going to let it run with the disabled disk 12 so I can sleep.  Every time I try to rebuild the drive and get an error, the whole http interface locks up.  So, this is now the third different disk that has supposedly had errors since I started this thread. (15, 16 and now 14) all different sizes and all have clean SMART reports. 
 

mrpunrfsx1-diagnostics-20240831-0155.zip

Edited by prongATO

  • Author

Not sure if this makes any difference but I didn’t try to rebuild disk 12, the dashboard shows 0 errors, yet the http locked up anyway. 

IMG_4883.jpeg

mrpunrfsx1-diagnostics-20240831-0250.zip

  • Community Expert
3 hours ago, prongATO said:

Happened during a parity check.  Now it's disk 12

Disk dropped offline, again looks like a power/connection issue.

I suggest you stop rebuild any disk, if disk were drop-out due to read error or crash and actually no problem.

 

When disk drop-out, does it can mount by UD ? if answer is positive then no meaning to fully write disk once.

 

Problem happen over a week. current troubleshooting method doesn't work, you need proactive or a way which could quickly reproduce the problem, so make work more effectively.

 

Could you use built in memtest to test, the one in Unraid booting menu( I know you have test by memtest plugin )

 

Then I will request some stress test later.

Edited by Vr2Io

  • Author
6 hours ago, Vr2Io said:

I suggest you stop rebuild any disk, if disk were drop-out due to read error or crash and actually no problem.

 

When disk drop-out, does it can mount by UD ? if answer is positive then no meaning to fully write disk once.

 

Problem happen over a week. current troubleshooting method doesn't work, you need proactive or a way which could quickly reproduce the problem, so make work more effectively.

 

Could you use built in memtest to test, the one in Unraid booting menu( I know you have test by memtest plugin )

 

Then I will request some stress test later.


And yes, as soon as the disk drops out of the array from errors, it’s immediately picked up by unassigned devices. 

  • Author

memtest.jpg

  • Author
8 hours ago, Vr2Io said:

I suggest you stop rebuild any disk, if disk were drop-out due to read error or crash and actually no problem.

 

When disk drop-out, does it can mount by UD ? if answer is positive then no meaning to fully write disk once.

 

Problem happen over a week. current troubleshooting method doesn't work, you need proactive or a way which could quickly reproduce the problem, so make work more effectively.

 

Could you use built in memtest to test, the one in Unraid booting menu( I know you have test by memtest plugin )

 

Then I will request some stress test later.

And to recap, before asking for assistance I've already replaced:

LBA Card

Power Supply and PS cables

Processor

LBA Card break out cables

different drives (which we see now it is not drive specific)

 

Other than it being the motherboard (God, I hope not), I'm not sure what else it could be.  I am going to revisit my power cabling and possibly switch to doing a discrete SATA cable (one cable with two connectors) to each 5 in 3 and double Molex (1 discrete cable for two cages)

 

Just now, prongATO said:

And to recap, before asking for assistance I've already replaced:

LBA Card

Power Supply and PS cables

Processor

LBA Card break out cables

different drives (which we see now it is not drive specific)

 

Other than it being the motherboard (God, I hope not), I'm not sure what else it could be.  I am going to revisit my power cabling and possibly switch to doing a discrete SATA cable (one cable with two connectors) to each 5 in 3 and double Molex (1 discrete cable for two cages)

 

 

Note, I m recheck different diagnostics's syslog before request some different test.

  • Author
2 minutes ago, Vr2Io said:

 

Note, I m recheck different diagnostics's syslog before request some different test.

Roger that, I'll make sure to supply a diagnostics after every change from this point forward.  How many passes until we're satisfied with the memtest?

 

It's through 3 passes and 54% on the 4th, all good.

Edited by prongATO

5 minutes ago, prongATO said:

It's through 3 passes and 54% on the 4th, all good.

Note and thanks, when I build or make change on hardware, I will first make memtest because it absolutely importance for system stability verify. Form my memory, I haven't DDR3 bad record but for DDR4 I got about 4 times bad record.

Edited by Vr2Io

  • Author
9 minutes ago, Vr2Io said:

Note and thanks, when I build or make change on hardware, I will first make memtest because it absolutely importance for system stability verify. Form my memory, I haven't DDR3 bad record but for DDR4 I got about 4 times bad record.

I'll let it run for 4 passes plus.  Since the server had been down all day, I might as well let it run until we're satisfied that there will be no found errors.  I'll post a screenshot of the test screen before I stop.

Checking on different log, there are no clear reason for disk drop, just random time. Once disk drop in array, you will got below in syslog and disk reconnect within 3 second and change drive letter, this was normal disk drop result.

 

Aug 24 11:42:49 MRPUNRFSX1 kernel: sd 1:0:1:0: device_block, handle(0x0009)
Aug 24 11:42:51 MRPUNRFSX1 kernel: sd 1:0:1:0: device_unblock and setting to running, handle(0x0009)

Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Attached scsi generic sg8 type 0
Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Power-on or device reset occurred
Aug 24 11:42:57 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL ()' is not set to auto mount.
-----

Aug 23 13:57:35 MRPUNRFSX1 kernel: sd 3:0:4:0: device_block, handle(0x000d)
Aug 23 13:57:37 MRPUNRFSX1 kernel: sd 3:0:4:0: device_unblock and setting to running, handle(0x000d)

Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Attached scsi generic sg11 type 0
Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Power-on or device reset occurred
Aug 23 13:57:43 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (dev1)' is not set to auto mount.
-----

Aug 23 16:57:20 MRPUNRFSX1 kernel: sd 4:0:3:0: device_block, handle(0x000c)
Aug 23 16:57:22 MRPUNRFSX1 kernel: sd 4:0:3:0: device_unblock and setting to running, handle(0x000c)

Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Attached scsi generic sg10 type 0
Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Power-on or device reset occurred
Aug 23 16:57:30 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (sdu)' is not set to auto mount.

 

 

(1) Pls ensure drop disk fall to UD, you can mount it and read the file content, then you actually not need rebuild it. And after problem fixing, then re-add to array and rebuild-parity.

 

(2) Stop disk auto spin-down.

 

(3) Open a terminal, type "dmesg -wT" to keep track does any "I/O error"

or

"tail -f /var/log/syslog" to check any device_block & device_unblock.

You can press "Enter" anytime for easy check any new log after that.

 

(4) Parity operation should generate maximum I/O and useful stress disk to drop-out, but in this case seems not. So I will suggest some different test hop can reproduce problem ASAP. But I won't know what will trigger this, so just try it.

 

Pls open 3 terminal, execute below command, concurrent two hash for big file should provide enough loading to CPU, the 3rd one just make another hash for small file.

 

find /mnt/disk5 -type f -size +8G -exec b3sum {} \;

hash file on disk5 which large then 8GB.

 

find /mnt/disk6 -type f -size +8G -exec b3sum {} \;

hash file on disk6 which large then 8GB.

 

find /mnt/user -type f -size -5k -print0 | sort -Rz | xargs -0 -n1 b3sum

Hash file on user share ( random sort ) which smaller then 5k.

 

image.png.61cf2a39e69f5b9f0a8d8921bb7f3d87.png

 

image.png.07c0d5e190678fba8185f39f9d84c5e1.png

 

 

 

Edited by Vr2Io

  • Author
43 minutes ago, Vr2Io said:

Checking on different log, there are no clear reason for disk drop, just random time. Once disk drop in array, you will got below in syslog and disk reconnect within 3 second and change drive letter, this was normal disk drop result.

 

Aug 24 11:42:49 MRPUNRFSX1 kernel: sd 1:0:1:0: device_block, handle(0x0009)
Aug 24 11:42:51 MRPUNRFSX1 kernel: sd 1:0:1:0: device_unblock and setting to running, handle(0x0009)

Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Attached scsi generic sg8 type 0
Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Power-on or device reset occurred
Aug 24 11:42:57 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL ()' is not set to auto mount.
-----

Aug 23 13:57:35 MRPUNRFSX1 kernel: sd 3:0:4:0: device_block, handle(0x000d)
Aug 23 13:57:37 MRPUNRFSX1 kernel: sd 3:0:4:0: device_unblock and setting to running, handle(0x000d)

Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Attached scsi generic sg11 type 0
Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Power-on or device reset occurred
Aug 23 13:57:43 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (dev1)' is not set to auto mount.
-----

Aug 23 16:57:20 MRPUNRFSX1 kernel: sd 4:0:3:0: device_block, handle(0x000c)
Aug 23 16:57:22 MRPUNRFSX1 kernel: sd 4:0:3:0: device_unblock and setting to running, handle(0x000c)

Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Attached scsi generic sg10 type 0
Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Power-on or device reset occurred
Aug 23 16:57:30 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (sdu)' is not set to auto mount.

 

 

(1) Pls ensure drop disk fall to UD, you can mount it and read the file content, then you actually not need rebuild it. And after problem fixing, then re-add to array and rebuild-parity.

 

(2) Stop disk auto spin-down.

 

(3) Open a terminal, type "dmesg -wT" to keep track does any "I/O error"

or

"tail -f /var/log/syslog" to check any device_block & device_unblock.

You can press "Enter" anytime for easy check any new log after that.

 

(4) Parity operation should generate maximum I/O and useful stress disk to drop-out, but in this case seems not. So I will suggest some different test hop can reproduce problem ASAP. But I won't know what will trigger this, so just try it.

 

Pls open 3 terminal, execute below command, concurrent two hash for big file should provide enough loading to CPU, the 3rd one just make another hash for small file.

 

find /mnt/disk5 -type f -size +8G -exec b3sum {} \;

hash file on disk5 which large then 8GB.

 

find /mnt/disk6 -type f -size +8G -exec b3sum {} \;

hash file on disk6 which large then 8GB.

 

find /mnt/user -type f -size -5k -print0 | sort -Rz | xargs -0 -n1 b3sum

Hash file on user share ( random sort ) which smaller then 5k.

 

image.png.61cf2a39e69f5b9f0a8d8921bb7f3d87.png

 

image.png.07c0d5e190678fba8185f39f9d84c5e1.png

 

 

 

First, man, your drives get toasty!  The highest I’ve ever seen any drive in my system is 38c.

 

I’ll get started on your suggestions tomorrow.  I decided to let the memtest run overnight, just to be safe. 

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.