6.12 - hba connected disks spun down shown as active - suddenly read errors - fixed after reboot

WoRie · June 18, 2023

Hi

I ran the 6.12 rc5 before updating to the final and I now encounter errors with my lsi 9300 hba.

If everything is freshly booted and spun up, all is fine. However, when disks spin down, the gui still shows a green dot for the disks connected to the hba, while the ones connected to my mainboard sata controller have the correct grey dot.

This wouldn't bother me, but the hba disks also suddenly report read errors after some time, which again are fixed after rebooting the host until the disks enter standby.

Could this be due to power management of the pci e devices? Or is my hba on the way out?

JorgeB · June 18, 2023

Please post the diagnostics.

WoRie · June 19, 2023

there you are.

I've since disabled anything APSM related in the BIOS. But now again all disks are spun down and won't come up again.

wonas-diagnostics-20230619-2019.zip

JorgeB · June 19, 2023

Jun 19 19:50:26 WoNas kernel: mpt3sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!

You are having HBA issues, make sure it's sufficiently cooled and well seated, or try a different PCIe slot if available.

WoRie · June 21, 2023

Hi JorgeB,

so, i think I f*cked up...

I pulled the HBA and repasted the heatspreader on the chip. The old thermal compound was completely dry, solid and oozed a solidified liquid that looked like treesap.

After reassembling, i booted unraid and started the rebuild of disk 10 (as referred in the screenshot).

During this, the connection broke down again. I believe the card is toast and dies when getting too warm (it's pretty warm here the last few days), even with new thermal compound and a directly attached fan.

The issue now is, that the rebuild of disk 10 hasn't finished and disk 8 showed also suddenly as "disabled - content emulated." And this is where I made a mistake I think...

I stopped the array, set the failed disk 8 to "disabled", started the array in maintenance, stopped it again and tried reassigning the disk to slot 8. But now it shows as a new device... I can power up the array only when i set disk 8 to unassigned, otherwise too many disks are missing / changed.

I don't want to carry on with the rebuild of disk 10 with this shot HBA, a new one should arrive tomorrow.

However, will I be able to fix this situation at all and what would be the best course of action? Will I be able to correctly reassign disk 8 after disk 10 has been rebuilt, or is the data on disk 8 gone and I have to add it as a new device?

The partition is still there in unassigned devices and my array is only 30% full, so if I can save the files somehow, that would be great. The files are not irreplacable, but nevertheless would be a hassle to aquire again.

image.png.0dc5ff8e9523b2cd82f9da2b57810334.png

JorgeB · June 21, 2023

Post new diags.

WoRie · June 23, 2023

There you are. I've just installed a new HBA, it says Data Rebuild but the two discs affected by the outage show up as unmountable

wonas-diagnostics-20230623-1637.zip

JorgeB · June 23, 2023

Unraid cannot emulate two disks with single parity, what happened to disk8? It's not even assigned.

WoRie · June 23, 2023

11 minutes ago, JorgeB said:

Unraid cannot emulate two disks with single parity

I know, thats why I'm a bit scared

11 minutes ago, JorgeB said:

[...] what happened to disk8? It's not even assigned.

It's showing up as new. The data on it is still accessible when mounted through unassigned devices, but I cannot reintroduce it into the array like this. And I also can't check the filesystem, because if i assign both, I cannot start the array due to 2 missing/new disks with single parity.

Edited June 23, 2023 by WoRie

JorgeB · June 23, 2023

SMART looks OK, I assume disk8 was the seconds one to get disabled? If yes we can force enable it to try and rebuild disk10, assuming parity is still valid.

WoRie · June 23, 2023

Yes, Disk8 suddendly showed up as dead.

How can I force it back into the array? I believe parity should still be valid and the disks should be fine.

The issue was the HBA. In my old case it was directly cooled through a case fan that was near it, in the new case this fan is missing and it was 30° C the last few days. I believe that was the culprit and the HBA died. I zip tied a small Noctua to the new HBA to be safe in the future

JorgeB · June 23, 2023

This will only work if parity is still valid, but if nothing else should re-enable disk8 and its data:

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed, including the old disk8 and current disk10
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk10
-Start array (in normal mode now) and post new diags.

WoRie · June 23, 2023

Here are the new diagnostics.

How would I go about rebuilding disk10 now, if (hopefully) everything is fine again?

wonas-diagnostics-20230623-1834.zip

JorgeB · June 23, 2023

I assume disk10 was xfs? If yes stop array, click on disk10, change fs from auto to xfs, post new diags after array start.

Also this is not very good:

Jun 23 18:35:00 WoNas kernel: md: disk2 read error, sector=128

WoRie · June 23, 2023

I changed disk10 to xfs and started the array. Disk10 shows as unsupported file system.

I don't know about the read errors. When the hba became instable I saw reported write errors, than cleaned up after an reboot.

wonas-diagnostics-20230623-2212.zip

JorgeB · June 24, 2023

Jun 23 22:10:47 WoNas kernel: md: disk2 read error, sector=8589934608
Jun 23 22:10:47 WoNas kernel: md: disk3 read error, sector=8589934608

There are read errors on multiple disks while trying to emulate disk10, run an extended SMART test on both.

WoRie · June 25, 2023

Both disks completed the extend test without error.

wonas-diagnostics-20230625-1128.zip

JorgeB · June 25, 2023

Replace cables for both disks and post new diags after array start.

WoRie · June 25, 2023

Here you go. Started the array in maintenance with new cables fresh out the box.

wonas-diagnostics-20230625-1733.zip

WoRie · June 25, 2023

disk 5 had problems negotiating a link, even with new cables. Now after some up and down the array is up. disk 10 reports as being emulated and i only can perform a read check but no rebuild of disk 10...

I think if i will be able to restore the array in full, i immediatly should move all files from these old disks, some from 2011 to the newer 18tb drives...

Can I rebuild disk10 which currently appears empty in the array or should I wipe it and readd it?

wonas-diagnostics-20230625-1811.zip

JorgeB · June 26, 2023

Still having issues with multiple disks, including a disk dropping offline, could be a power problem.

6.12 - hba connected disks spun down shown as active - suddenly read errors - fixed after reboot

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation