My setup is in a tower case with 5 case/CPU fans, a Supermicro X10SL7-F motherboard, 32GB Crucial/Micron ECC RAM, Xeon 1285v3 CPU, 4 iStarUSA 5 in 3s to total a possible 20 HDD (17 slots are filled) and I have two SATA SSD running cache_ssd and all my dockers and processing is done on an NVMe drive installed in a add on card in a slot. It currently has a 650W Seasonic PSU but it’s 10 years old. I’ve been chasing this problem for months now. Drive 15 is my second youngest drive in the system, it’s just over year old. Drive 15 periodically reports errors but every SMART test run, comes back clean. I’ve replaced the breakout cables for the LBA card, I’ve replaced the LBA card, I’ve replaced the CPU, I even tried a new drive in different slots and got the same errors. I installed an additional fan blowing directly on the heatsink of the LBA card. I ordered a new power supply that will be here tomorrow, that’s the last thing I can think of that may be the issue. drive 15 is the drive that keeps reporting errors. I’ve tried it in two different 5 in 3 enclosures, every free slot with the same results. Sometimes, after I try to rebuild the data from parity, it will finish and be okay for a week or two then reports errors again. Sometimes, the rebuilds fail, it reports errors, disables the drive, then it shows up on unassigned devices. I am currently trying to rebuild the drive again and was waiting to see if it would error out again but noticed some weird things in the syslog, so I decided to go ahead and make a post. Could all of this be because of a failing PSU? I usually am able to solve any issues by referencing others posts and have waited to post this but I need some help. mrpunrfsx1-syslog-20240822-1943.zip

To be more solid, you could try one more parity check, if result positive could assume problem cause by plugin, then rule out which one is. I would suggest uninstall disklocation first.

unraid v6.12.11 odd errors - LBA w/drive errors - Page 2 - General Support

August 28, 20241 yr

Author

Well, that didn’t last long. At about 3:45, drive 16 reported errors again. I have reboot because this happens quite a bit but this segfault normally happens during a rebuild rather than an initial error, hopefully you can see the “segfault at IP” error and when that happens it partially locks up the http interface. Nothing I try to do involving the array will work when this specific set of errors occur.

I didn’t have time to change the cage today but will tomorrow. I also find it interesting that it’s disk 16 again with errors. (ZFS disk just for automated backups of my other ZFS files)

diagnostics attached.

any ideas or insight?

mrpunrfsx1-diagnostics-20240828-0400.zip

Edited August 29, 20241 yr by prongATO

Quote

August 28, 20241 yr

Community Expert

Those are not diags.

Quote

August 29, 20241 yr

Author

18 hours ago, JorgeB said:

Those are not diags.

yup chose the wrong file, I'll fix it. Thanks for the heads-up

Quote

August 29, 20241 yr

Community Expert

Still looks like a power/connection to me.

There are also NIC issues logged, and emhttp segfaulted, so you'll need to reboot.

Quote

August 30, 20241 yr

Author

21 hours ago, JorgeB said:

Still looks like a power/connection to me.

There are also NIC issues logged, and emhttp segfaulted, so you'll need to reboot.

I rearranged some drives, changed the power to one connection PS to two moles per cage. Then the last two ports I ran a SATA power connection to two cages her port.

Drive 16 rebuilt no problem. Do you think it was just that as I kept putting more and more drives in, the power just couldn't keep up?

I am going to remove disk 16 soon because it’s onky there for ZFS replication of my other ZFS drives. Follow this guide? https://docs.unraid.net/legacy/FAQ/shrink-array/

I think diagnostics look good. I am going to let the server run for a few days before I try to shrink the array. I don’t want any errors to pop up while that’s happening.

mrpunrfsx1-diagnostics-20240830-0115.zip

Edited August 30, 20241 yr by prongATO

Quote

August 30, 20241 yr

Community Expert

1 hour ago, prongATO said:

Do you think it was just that as I kept putting more and more drives in, the power just couldn't keep up?

It's possible.

Diags look good so far.

Quote

August 30, 20241 yr

Author

5 hours ago, JorgeB said:

It's possible.

Diags look good so far.

IMO, it is the mostly likely solution. When I took disk 15 out of the 5 in 3 and hooked it directly to a standalone SATA power connection on the new 850W PS, it rebuilt with no issues when I couldn’t get it to rebuild before. None of the issues started until I started adding drives to the 4th cage in my system. So, it stands to reason that the more drives I added the more power starved they became, which it looks like, can cause unraid to report errors.

Quote

August 30, 20241 yr

10 hours ago, prongATO said:

Drive 16 rebuilt no problem. Do you think it was just that as I kept putting more and more drives in, the power just couldn't keep up?

Not likely insufficient power, my 16 disk build was use a 850w PSU, it max under 600w and average ~350w, but you need understand the 5v come from 12v and usually provide combine power ~100w no matter PSU are 850w or 1500w.

The reason for each cage need two different cable is to minimise the resistance and avoid voltage drop too much. Let's simple say, if pass 1A to wire for 1ohm, then voltage drop is 1v, if that for 12v, it will be 11/12 drop 8.3%, but if 5v 4/5 will be 20%.

So connect through cage to disks, it will increase in resistance and you need minimize it.

If reason are insufficient power, then no matter how you connect those disks, you still will got power shorter.

Edited August 30, 20241 yr by Vr2Io

Quote

August 30, 20241 yr

Author

On 8/29/2024 at 3:43 AM, JorgeB said:

Still looks like a power/connection to me.

There are also NIC issues logged, and emhttp segfaulted, so you'll need to reboot.

I am 100% the nic issue is a cable. I was a dummy and left the patch cables hanging off the connector. They are really nice certified Cat6A patch cables, so I used some zip ties and added support for the cables instead of just letting the weight of the 10' cables just hang off the network connectors.

Quote

August 30, 20241 yr

Author

6 hours ago, Vr2Io said:

Not likely insufficient power, my 16 disk build was use a 850w PSU, it max under 600w and average ~350w, but you need understand the 5v come from 12v and usually provide combine power ~100w no matter PSU are 850w or 1500w.

The reason for each cage need two different cable is to minimise the resistance and avoid voltage drop too much. Let's simple say, if pass 1A to wire for 1ohm, then voltage drop is 1v, if that for 12v, it will be 11/12 drop 8.3%, but if 5v 4/5 will be 20%.

So connect through cage to disks, it will increase in resistance and you need minimize it.

If reason are insufficient power, then no matter how you connect those disks, you still will got power shorter.

I just finished reconfiguring everything inside. Since I only have 6 ports on the power supply for SATA/Molex cables, I used 4 or the ports to run dual Molex connectors to each cage. I moved the SATA SSDs into cage 4 instead of having them inside the case on a bracket. I used the remaining two SATA/Molex power ports on the power supply to run a SATA power connection to two cages, then the other to the other two. Since I know I'm going to start reducing the number of drives I have in my system to avoid future power considerations, I don't feel it's inefficient to use some of the trays for SSDs now. (see attached pic of the new layout, the only drive missing is the NVMe that's on a PCIE addon card) I am going to start getting 12TB drives, starting with the parity drive, of course. Then I'll start moving data off some of the old 3TB and 4TB drives and reallocating the 6TB drives to replace the others. I want to stay below 15 total HDD in the system from now on.

I wanted to get everything squared-away before midnight tonight, when my monthly parity check runs. I'm considering shrinking the array (removing disk 16 [3TB] because it's only there for ZFS replication and snapshots)

Edited August 30, 20241 yr by prongATO
added photo of server for reference

Quote

1

August 31, 20241 yr

Author

I went ahead and started a parity check today since I realized the 1st will be in another day. I want to let the system hammer the drives for a day or so to make sure everything is copacetic with power/error issues now. If the parity check completes and I successfully downsize the array, removing disk 16, I will mark this as solved. @Vr2Io and @JorgeB, thank you both so much for your help with all of this. I know it was a bunch of trial and error but it was really nice to have two different perspectives while attempting to solve the issues.

Quote

1

August 31, 20241 yr

Author

well, $hit.. Diagnostics attached. Happened during a parity check. Now it's disk 12

is it just a coincidence that the amount of errors are always divisible by 1024?

mrpunrfsx1-diagnostics-20240831-0114.zip

Edited August 31, 20241 yr by prongATO

Quote

August 31, 20241 yr

Author

Tried rebuilding disk 12, I am back where I started. I’m about to just give up and throw this thing out the window.

i am just going to let it run with the disabled disk 12 so I can sleep. Every time I try to rebuild the drive and get an error, the whole http interface locks up. So, this is now the third different disk that has supposedly had errors since I started this thread. (15, 16 and now 14) all different sizes and all have clean SMART reports.

mrpunrfsx1-diagnostics-20240831-0155.zip

Edited August 31, 20241 yr by prongATO

Quote

August 31, 20241 yr

Author

Not sure if this makes any difference but I didn’t try to rebuild disk 12, the dashboard shows 0 errors, yet the http locked up anyway.

mrpunrfsx1-diagnostics-20240831-0250.zip

Quote

August 31, 20241 yr

Community Expert

3 hours ago, prongATO said:

Happened during a parity check. Now it's disk 12

Disk dropped offline, again looks like a power/connection issue.

Quote

August 31, 20241 yr

I suggest you stop rebuild any disk, if disk were drop-out due to read error or crash and actually no problem.

When disk drop-out, does it can mount by UD ? if answer is positive then no meaning to fully write disk once.

Problem happen over a week. current troubleshooting method doesn't work, you need proactive or a way which could quickly reproduce the problem, so make work more effectively.

Could you use built in memtest to test, the one in Unraid booting menu( I know you have test by memtest plugin )

Then I will request some stress test later.

Edited August 31, 20241 yr by Vr2Io

Quote

August 31, 20241 yr

Author

6 hours ago, Vr2Io said:

I suggest you stop rebuild any disk, if disk were drop-out due to read error or crash and actually no problem.

When disk drop-out, does it can mount by UD ? if answer is positive then no meaning to fully write disk once.

Problem happen over a week. current troubleshooting method doesn't work, you need proactive or a way which could quickly reproduce the problem, so make work more effectively.

Could you use built in memtest to test, the one in Unraid booting menu( I know you have test by memtest plugin )

Then I will request some stress test later.

And yes, as soon as the disk drops out of the array from errors, it’s immediately picked up by unassigned devices.

Quote

September 1, 20241 yr

Author

Quote

1

September 1, 20241 yr

Author

8 hours ago, Vr2Io said:

I suggest you stop rebuild any disk, if disk were drop-out due to read error or crash and actually no problem.

When disk drop-out, does it can mount by UD ? if answer is positive then no meaning to fully write disk once.

Problem happen over a week. current troubleshooting method doesn't work, you need proactive or a way which could quickly reproduce the problem, so make work more effectively.

Could you use built in memtest to test, the one in Unraid booting menu( I know you have test by memtest plugin )

Then I will request some stress test later.

And to recap, before asking for assistance I've already replaced:

LBA Card

Power Supply and PS cables

Processor

LBA Card break out cables

different drives (which we see now it is not drive specific)

Other than it being the motherboard (God, I hope not), I'm not sure what else it could be. I am going to revisit my power cabling and possibly switch to doing a discrete SATA cable (one cable with two connectors) to each 5 in 3 and double Molex (1 discrete cable for two cages)

Quote

September 1, 20241 yr

Just now, prongATO said:

And to recap, before asking for assistance I've already replaced:

LBA Card

Power Supply and PS cables

Processor

LBA Card break out cables

different drives (which we see now it is not drive specific)

Other than it being the motherboard (God, I hope not), I'm not sure what else it could be. I am going to revisit my power cabling and possibly switch to doing a discrete SATA cable (one cable with two connectors) to each 5 in 3 and double Molex (1 discrete cable for two cages)

Note, I m recheck different diagnostics's syslog before request some different test.

Quote

1

September 1, 20241 yr

Author

2 minutes ago, Vr2Io said:

Note, I m recheck different diagnostics's syslog before request some different test.

Roger that, I'll make sure to supply a diagnostics after every change from this point forward. How many passes until we're satisfied with the memtest?

It's through 3 passes and 54% on the 4th, all good.

Edited September 1, 20241 yr by prongATO

Quote

September 1, 20241 yr

5 minutes ago, prongATO said:

It's through 3 passes and 54% on the 4th, all good.

Note and thanks, when I build or make change on hardware, I will first make memtest because it absolutely importance for system stability verify. Form my memory, I haven't DDR3 bad record but for DDR4 I got about 4 times bad record.

Edited September 1, 20241 yr by Vr2Io

Quote

September 1, 20241 yr

Author

9 minutes ago, Vr2Io said:

Note and thanks, when I build or make change on hardware, I will first make memtest because it absolutely importance for system stability verify. Form my memory, I haven't DDR3 bad record but for DDR4 I got about 4 times bad record.

I'll let it run for 4 passes plus. Since the server had been down all day, I might as well let it run until we're satisfied that there will be no found errors. I'll post a screenshot of the test screen before I stop.

Quote

September 1, 20241 yr

Checking on different log, there are no clear reason for disk drop, just random time. Once disk drop in array, you will got below in syslog and disk reconnect within 3 second and change drive letter, this was normal disk drop result.

Aug 24 11:42:49 MRPUNRFSX1 kernel: sd 1:0:1:0: device_block, handle(0x0009)
Aug 24 11:42:51 MRPUNRFSX1 kernel: sd 1:0:1:0: device_unblock and setting to running, handle(0x0009)

Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Attached scsi generic sg8 type 0
Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Power-on or device reset occurred
Aug 24 11:42:57 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL ()' is not set to auto mount.
-----

Aug 23 13:57:35 MRPUNRFSX1 kernel: sd 3:0:4:0: device_block, handle(0x000d)
Aug 23 13:57:37 MRPUNRFSX1 kernel: sd 3:0:4:0: device_unblock and setting to running, handle(0x000d)

Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Attached scsi generic sg11 type 0
Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Power-on or device reset occurred
Aug 23 13:57:43 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (dev1)' is not set to auto mount.
-----

Aug 23 16:57:20 MRPUNRFSX1 kernel: sd 4:0:3:0: device_block, handle(0x000c)
Aug 23 16:57:22 MRPUNRFSX1 kernel: sd 4:0:3:0: device_unblock and setting to running, handle(0x000c)

Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Attached scsi generic sg10 type 0
Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Power-on or device reset occurred
Aug 23 16:57:30 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (sdu)' is not set to auto mount.

(1) Pls ensure drop disk fall to UD, you can mount it and read the file content, then you actually not need rebuild it. And after problem fixing, then re-add to array and rebuild-parity.

(2) Stop disk auto spin-down.

(3) Open a terminal, type "dmesg -wT" to keep track does any "I/O error"

or

"tail -f /var/log/syslog" to check any device_block & device_unblock.

You can press "Enter" anytime for easy check any new log after that.

(4) Parity operation should generate maximum I/O and useful stress disk to drop-out, but in this case seems not. So I will suggest some different test hop can reproduce problem ASAP. But I won't know what will trigger this, so just try it.

Pls open 3 terminal, execute below command, concurrent two hash for big file should provide enough loading to CPU, the 3rd one just make another hash for small file.

find /mnt/disk5 -type f -size +8G -exec b3sum {} \;

hash file on disk5 which large then 8GB.

find /mnt/disk6 -type f -size +8G -exec b3sum {} \;

hash file on disk6 which large then 8GB.

find /mnt/user -type f -size -5k -print0 | sort -Rz | xargs -0 -n1 b3sum

Hash file on user share ( random sort ) which smaller then 5k.

image.png.61cf2a39e69f5b9f0a8d8921bb7f3d87.png

image.png.07c0d5e190678fba8185f39f9d84c5e1.png

Edited September 1, 20241 yr by Vr2Io

Quote

September 1, 20241 yr

Author

43 minutes ago, Vr2Io said:
Checking on different log, there are no clear reason for disk drop, just random time. Once disk drop in array, you will got below in syslog and disk reconnect within 3 second and change drive letter, this was normal disk drop result.
Aug 24 11:42:49 MRPUNRFSX1 kernel: sd 1:0:1:0: device_block, handle(0x0009)
Aug 24 11:42:51 MRPUNRFSX1 kernel: sd 1:0:1:0: device_unblock and setting to running, handle(0x0009)

Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Attached scsi generic sg8 type 0
Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Power-on or device reset occurred
Aug 24 11:42:57 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL ()' is not set to auto mount.
-----

Aug 23 13:57:35 MRPUNRFSX1 kernel: sd 3:0:4:0: device_block, handle(0x000d)
Aug 23 13:57:37 MRPUNRFSX1 kernel: sd 3:0:4:0: device_unblock and setting to running, handle(0x000d)

Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Attached scsi generic sg11 type 0
Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Power-on or device reset occurred
Aug 23 13:57:43 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (dev1)' is not set to auto mount.
-----

Aug 23 16:57:20 MRPUNRFSX1 kernel: sd 4:0:3:0: device_block, handle(0x000c)
Aug 23 16:57:22 MRPUNRFSX1 kernel: sd 4:0:3:0: device_unblock and setting to running, handle(0x000c)

Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Attached scsi generic sg10 type 0
Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Power-on or device reset occurred
Aug 23 16:57:30 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (sdu)' is not set to auto mount.
(1) Pls ensure drop disk fall to UD, you can mount it and read the file content, then you actually not need rebuild it. And after problem fixing, then re-add to array and rebuild-parity.

(2) Stop disk auto spin-down.

(3) Open a terminal, type "dmesg -wT" to keep track does any "I/O error"

or

"tail -f /var/log/syslog" to check any device_block & device_unblock.

You can press "Enter" anytime for easy check any new log after that.

(4) Parity operation should generate maximum I/O and useful stress disk to drop-out, but in this case seems not. So I will suggest some different test hop can reproduce problem ASAP. But I won't know what will trigger this, so just try it.

Pls open 3 terminal, execute below command, concurrent two hash for big file should provide enough loading to CPU, the 3rd one just make another hash for small file.

find /mnt/disk5 -type f -size +8G -exec b3sum {} \;

hash file on disk5 which large then 8GB.

find /mnt/disk6 -type f -size +8G -exec b3sum {} \;

hash file on disk6 which large then 8GB.

find /mnt/user -type f -size -5k -print0 | sort -Rz | xargs -0 -n1 b3sum

Hash file on user share ( random sort ) which smaller then 5k.

First, man, your drives get toasty! The highest I’ve ever seen any drive in my system is 38c.

I’ll get started on your suggestions tomorrow. I decided to let the memtest run overnight, just to be safe.

Quote

unraid v6.12.11 odd errors - LBA w/drive errors

Featured Replies

Solved by Vr2Io

Top Posters In This Topic

Popular Days

Most Popular Posts

prongATO

JorgeB

prongATO

Posted Images

Join the conversation

Top Posters In This Topic

Popular Days

Most Popular Posts

prongATO

JorgeB

prongATO

Posted Images

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)