Hard Drive Errors Help its driving me nuts

September 24, 20187 yr

It's back again.. argg i don't get this problem at all.

So about 3 months ago i started getting Read Errors, Jonny and a few others helped me identify the issue.

I believed it was my LSI Card because all the drives having the issue were plugged into it.

When i replaced it my problems went away so i thought.

Now it's back again 2 months later all my drives on the LSI Card again.. and its a new card and my last card had no issues LSI said to me.

It's happened 5 times since then, i've calculated my RM850 is capable of driving this many drives power wise.

They are spread evenly over the power connectors.

I've checked my SATA power cables and SATA Data cables.

I've replaced my SATA Data cables.

I changed the profile for my hard drives to always be spinning and not spin down so that i don't get start spikes.

They will go into their own low power mode when not requesting data.

I've run a memtest over the system for 9 hours no issues there, i've now even removed one stick of memory that isn't matched with my other sticks.

There is no overclocking set on the system at all.

My guess right now is either my motherboard is dying or my power supply.

What i don't understand though is i had it stable for 2 months even with reboots it was fine.

I installed 6.6.0 and within 10 minutes it went read errors again, i rolled back to 6.5.3 and it was ok for about 3 days then read errors again.

I can do a full parity rebuild and no issues, it will run for a weeks or it will run for an hour after a reboot.

I'd imagine if it was the power supply the problem would happen immediately? It shouldn't go away for 2 months till i reboot it?

I feel it might be motherboard related, maybe the I/O Controller i do have every slot in use.

My LSI Card i assume will do its own I/O without using the on board controller at all.

Could it be the motherboard or some setting that can cause some kind of PCI-X problem.

Any suggestions of turning things off in the bios that could cause PCI-X problems.

Quote

September 24, 20187 yr

Author

Probably the best way of describing it, it errors within the first hour of booting up or it's stable till the next reboot be it 2-3 months.

I don't know what could be causing it, when it happens you just reboot and it might be stable.

Even the environment temps are relatively the same of 21-24c. its not too hot or cold here.

Quote

September 24, 20187 yr

Author

vault-diagnostics-20180924-1418.zip

Quote

September 24, 20187 yr

Intermittent really hard to shoot

Some question

7 error disk under same controller ?

Does those error disk also got SMART error or just parity drop out disk have SMART error change ?

I'm thinking why those 7 disk haven't temperature reading but in green ball.

Quote

September 24, 20187 yr

Author

7 disks this time are on the LSI Card, the 8th one was fine on the LSI Card.

The reboot 2 days ago was Disk 1 and Disk 10 only, Disk 1 is on the LSI Card and Disk 10 is on the onboard controller.

The errors are only ever going to the system as read errors, they never hit the SMART.

After a reboot its like it never happened.

The *'s were because it was just booted up when i screenshoted it takes a few minutes for Unraid to pickup the temps.

There are there now as attached.

In 20 years of hardware i've never seen this type of issue before.

I also have this plugged into an APC Smart UPS, so the power is always filtered.

Now doing a Parity rebuild without errors and it will be fine now till i reboot it again, if i left it for 4 months i know i won't have any errors.

Each reboot is rolling that dice will it be error free, just wait a few hours and if no errors its fine however long till i reboot it.

It's truly bizarre when it happens sometimes i have to reboot it 4-5 times in maintenance mode just to make sure i have no read errors.

I've also tried the whole shutdown the system pull out the power. press the power button to drain the system of power and turn it on.

It can still do it even if its been fully powered down.

Quote

September 24, 20187 yr

Author

here we go again... throwing errors 1 hour and 30 mins after being fine.

vault-diagnostics-20180924-1454.zip

Quote

September 24, 20187 yr

Author

Disk 1 is on the on board controller and its erroring with the LSI ones. weidness at its best.

Diags of it are attached.

Quote

September 24, 20187 yr

31 minutes ago, Maticks said:

Now doing a Parity rebuild without errors and it will be fine now till i reboot it again, if i left it for 4 months i know i won't have any errors.

Each reboot is rolling that dice will it be error free, just wait a few hours and if no errors its fine however long till i reboot it.

Interesting, sounds like problem may occur after each reboot.

I suggest do a simple thing first, only attach 2 memory and try. ( if XMP enable, then disable it )

As you mention, once reboot may got problem, so it may relate some automatic config after each reboot.

Edited September 24, 20187 yr by Benson

Quote

September 24, 20187 yr

Author

i just tried factory resetting the bios.

maybe something was turned on causing an issue.

also disabled XMP.

Lets see how we are this time around...

Quote

September 24, 20187 yr

Author

200GB into rebuild still ok and 38 minutes.

Let's see in an hour how it is. Thanks @Benson

Quote

September 24, 20187 yr

Thanks, seems need much more time to verify, waiting for your good news.

Edited September 24, 20187 yr by Benson

Quote

September 24, 20187 yr

Author

Just crossed the two hour mark, though its hard to tell if changing the settings helped make it stable or it just was ok this reboot.

Will see how it goes, but if anyone else on this forum knows what is causing this let me know.

Quote

September 24, 20187 yr

Author

@Benson i spoke too soon.

This time the two parity drives and one disk on the LSI card as well.

still cant work out whats causing it.

Just took out another stick of memory so i have 2 in their now.

If it happens again ill change out one stick give it another go and then change it around after that.

It should at least rule that out. This feels like memory to me given its pretty random.

Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221102000000)
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing : enclosure logical id(0x500605b003cc0de0), slot(1)
Sep 24 20:40:12 Vault kernel: sd 1:0:4:0: [sdf] Synchronizing SCSI cache
Sep 24 20:40:12 Vault kernel: sd 1:0:4:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing handle(0x000c), sas_addr(0x4433221104000000)
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing : enclosure logical id(0x500605b003cc0de0), slot(7)
Sep 24 20:40:12 Vault kernel: sd 1:0:5:0: [sdg] Synchronizing SCSI cache
Sep 24 20:40:12 Vault kernel: sd 1:0:5:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing handle(0x000e), sas_addr(0x4433221105000000)
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing : enclosure logical id(0x500605b003cc0de0), slot(6)
Sep 24 20:40:12 Vault kernel: md: disk0 write error, sector=1586782472
Sep 24 20:40:12 Vault kernel: md: disk29 write error, sector=1586782472
Sep 24 20:40:12 Vault kernel: sd 1:0:6:0: [sdh] Synchronizing SCSI cache
Sep 24 20:40:12 Vault kernel: sd 1:0:6:0: [sdh] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP received, forcing refresh of disks info.
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing handle(0x000f), sas_addr(0x4433221106000000)
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: removing : enclosure logical id(0x500605b003cc0de0), slot(5)
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: sending diag reset !!
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP received, forcing refresh of disks info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault rc.diskinfo[10970]: SIGHUP ignored - already refreshing disk info.
Sep 24 20:40:12 Vault kernel: mpt2sas_cm0: diag reset: FAILED
Sep 24 20:40:13 Vault kernel: iommu: Removing device 0000:02:00.0 from group 1
Sep 24 20:40:35 Vault kernel: md: disk1 read error, sector=2930292224
Sep 24 20:40:35 Vault kernel: md: disk1 read error, sector=2930292232
Sep 24 20:40:35 Vault kernel: XFS (md1): metadata I/O error: block 0xaea8b607 ("xlog_iodone") error 5 numblks 64
Sep 24 20:40:35 Vault kernel: XFS (md1): xfs_do_force_shutdown(0x2) called from line 1232 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa02cd5de
Sep 24 20:40:35 Vault kernel: XFS (md1): Log I/O Error Detected. Shutting down filesystem
Sep 24 20:40:35 Vault kernel: XFS (md1): Please umount the filesystem and rectify the problem(s)

Edited September 24, 20187 yr by Maticks

Quote

September 24, 20187 yr

Community Expert

These type of issues are very hard to diagnose, basically you need to start swapping some hardware, if the HBA was already replaced my next step would be to try a different PSU, although the one you have has plenty of power it can have a problem.

Quote

September 24, 20187 yr

Author

4 minutes ago, johnnie.black said:

These type of issues are very hard to diagnose, basically you need to start swapping some hardware, if the HBA was already replaced my next step would be to try a different PSU, although the one you have has plenty of power it can have a problem.

im trying switching around memory at the moment.

i've removed two sticks so i have only two in their now.

Will switch out one stick at a time till have tried two different sticks.

If that happens ill switch out the power supply i have an RM850i in my desktop and an RM850 in my server so i can switch them around and see if that fixes it.

Cabling is exactly the same from what i can see.

Apart from that my motherboard is a Z87 EVGA about 5 years or 6 years old. So maybe its on the way out.

It's hard to tell nothing is standing out as being faulty, i hate these types of issues. i like easy ones, smoke, clicking or just dead.

Gremlins in my server.

Also HBA is running P20 which is the latest patch so thats ok as well.

Edited September 24, 20187 yr by Maticks

Quote

September 24, 20187 yr

Author

RM has 14 + 10 24pin layout while the RMi has for some reason 18 + 10 layout.

That would end with a very bad day and i am sure then the motherboard will be needing replacing.

Best i switch those 24 pin cables with the right power supply, what i stupid thing to do.

People made custom RM cables for their RMi power supplies and discovered it did more than just look pretty.

What has me stumped is when this happened 3 months ago i did my disk rebuilds and rebooted it a few times no issues also slammed it with read and writes.

Even build File Integrity which slammed the whole array making hashes, not a problem.

If it was the PSU i'd expect it to be flaky here and there not go 70 days with no issues. It so odd, but clearly something hardware is not happy.

Edited September 24, 20187 yr by Maticks

Quote

September 24, 20187 yr

Do you have active airflow across the HBA heatsink?

Quote

September 25, 20187 yr

Author

I've tried switching out all my memory and now changing from slots 1 and 3 too 2 and 4.

@jonathanm there isn't a lot of airflow in that section of the case, but it hasn't presented an issue before.

I just installed a 120mm fan and pushing cold air all over the HBA lets see if that helps thanks for pointing that out.

I have touched the HBA heatsink it doesn't feel that hot but who knows maybe it hits a high temp at some point and throws errors.

Next fail attempt ill be changing over the power supply and see if that helps.

Quote

September 25, 20187 yr

Author

So i pointed a Fan at the HBA headsink and it still crashed 30 minutes later.

I removed the PSU from my desktop system which is an RM850i and replaced my server RM850 power supply with it.

So far completely stable.. but it does do this. I am doing a parity rebuild and will see if it breaks down or finishes in 11 hours.

If it finishes ill give it a reboot in the morning leave it for an hour or two and give it a few more reboots see if i get the read errors again.

It’s looking like a power supply replacement is in order $200 i can deal with. Im considering the HX750i given i was planning to add on 4 more drives in the future.

I use around 250Watts at peak load.

Any suggestions for a brand of power supply or model that gives me more drive support or more power support on the 5V rail i think is whats needed here more.

Corsair i’ve always had a good experience with so i’ve always picked them.

Hard Drives and CPU is all i use this system for no graphics cards.

Quote

September 25, 20187 yr

FWIW I had a similar problem recently. Tried some hardware configuration changes (didn't have any spare parts), but they didn't help.

Most recently (8 days ago) I installed a new PSU, and I'm still keeping my fingers crossed.

I think that a month without errors will allow a good certainty that the problem is solved.

Quote

September 25, 20187 yr

On 9/24/2018 at 7:20 PM, Maticks said:

If it was the PSU i'd expect it to be flaky here and there not go 70 days with no issues. It so odd, but clearly something hardware is not happy.

Agree, that's why I won't target on PSU. ( I still don't think it is problem source )

10 hours ago, Maticks said:

I have touched the HBA heatsink it doesn't feel that hot but who knows maybe it hits a high temp at some point and throws errors.

All my LSI HBA was hot ~70c.

Or you may ref. below case too ( disable disk spindown )

Edited September 25, 20187 yr by Benson

Quote

September 26, 20187 yr

Author

I've already disabled all my disks spinning down about 3 months ago.

I was suspect that the power spikes of disks spinning up might cause some issues with how many drives i have.

So that made no difference.

The memory switching didn't make any difference nor did picking different memory slots.

What did make a difference when i switched over my PSU and the power cables all the problems have gone away.

@johnnie.blackhas been saying i think for the last 3 times this has happened its my power supply but is simply didn't believe him it was that.

But after changing it out, it's just gone away.

i've pointed a 120mm fan at the HBA heatsink there is low too no airflow there. I'll be keeping that change.

My desktop power supply is in my Server an RM850i and the RM850 from the server is in my Desktop for now.

I've rebuilt both parity drives, rebooted the system about 6 times without any issues.

I am now going through running File Integrity on the drives Disk 1 and Disk 10 in particular since they were hit the worst.

So far 6 files corrupted so i will manually restore those.

But looking good so far.

I've ordered a new Corsair HX750i rather spend a bit more and get that 10 Year warranty.

I have contacted Corsair support about my RM850 from 2014 and too my surprise they've shipped me a replacement and charged me for it.

But when they get my RM850 back they will refund the cost of the RM850 replacement.

Once the HX750i arrives tomorrow i'll put that in my server.

Move the RM850i back into my desktop machine.

I can push through the USB Header the power supply is plugged into on the corsair link cable to a windows VM.

Then i can read the power supply data directly off the VM check how things look going forward.

My RM850 will go to a mate of mine building a new system, that will save him $200 and it has a new 5 year warranty.

I am somewhat hopeful it is over, but still have a little thing in the back of my head say you never saw something break and it be fixed.

I honestly think it was my spin down of 13 disks, when plex would do its nightly scan they'd all turn on at once every night for 2 years.

It would have been several times a day the same surge of disks starting, it could have overwealmed the 5V rail or something odd.

The RM850 is working fine in my desktop but i have no SATA perf's plugged in the SSD's are M.2's and a graphics card on PCIE cables.

I won't risk running something dying though in a good machine it will blow up at some point i am sure.

Also i might add that putting a fan at the HBA has also seem my disk speeds shoot up a good 60MB/sec more. Maybe here was some thermal throttling there also.

Edited September 26, 20187 yr by Maticks

Quote

September 26, 20187 yr

Note.

Congrats you got stable again !!

Quote

September 29, 20187 yr

Author

I upgraded to 6.6.1 and the read errors came back but it was a bit different this time.

It never broke the parity disks and it only happened for a little while and it was only 4 disks not like last time.

Truly weird, i've rolled back to 6.5.3 and attached the diag file maybe something in my system is too old for 6.6.1.

vault-diagnostics-20180929-0959.zip

Quote

September 29, 20187 yr

Author

Thats what started it off.

Sep 29 01:50:39 Vault kernel: mpt2sas_cm0: SAS host is non-operational !!!!
### [PREVIOUS LINE REPEATED 5 TIMES] ###
Sep 29 01:50:44 Vault kernel: mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
Sep 29 01:50:44 Vault kernel: sd 7:0:0:0: [sdb] Synchronizing SCSI cache
Sep 29 01:50:44 Vault kernel: sd 7:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:0:0: [sdb] tag#0 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 0
Sep 29 01:50:44 Vault kernel: sd 7:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:0:0: [sdb] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e 99 fe d0 00 00 00 08 00 00
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1587150544
Sep 29 01:50:44 Vault kernel: sd 7:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554072
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554384
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554504
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554560
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554624
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554640
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554744
Sep 29 01:50:44 Vault kernel: print_req_error: I/O error, dev sdb, sector 1588554808
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e 99 fe d0 00 00 00 08 00 00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e b0 18 a0 00 00 00 10 00 00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e b0 19 10 00 00 00 08 00 00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e b0 19 40 00 00 00 08 00 00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e b0 19 78 00 00 00 08 00 00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e b0 1b 70 00 00 00 08 00 00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Sep 29 01:50:44 Vault kernel: sd 7:0:4:0: [sdf] tag#0 CDB: opcode=0x88 88 00 00 00 00 00 5e b0 1b 88 00 00 00 08 00 00
Sep 29 01:50:44 Vault kernel: mpt2sas_cm0: removing handle(0x000c), sas_addr(0x4433221103000000)
Sep 29 01:50:44 Vault kernel: mpt2sas_cm0: enclosure logical id(0x500605b003cc0de0), slot(0)
Sep 29 01:50:44 Vault kernel: sd 7:0:1:0: [sdc] Synchronizing SCSI cache

Quote

Hard Drive Errors Help its driving me nuts

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)