Array errors when a VM starts

July 31, 20223 yr

A couple of days ago I started to get errors on one of my drives. I thought nothing of it and resumed as normal, until I decided to simply try and rebuild that drive. During the rebuild process, a second drive encountered the same issue.

Something is probably off here, I decided to rebuild again and it ran for many many hours without error, until I started a VM and pretty much instantly began to get errors on all the drives.

Consistenty, I can replicate this problem. I boot unraid, the rebuild starts, runs for however long, until I start a VM at which point errors from all directions.

I have attempted to Google this and a couple of suggestions pointed to passthrough, but as far as I am aware, I am not passing through anything other than the unraid mount itself (which I have also tried to remove with no luck). I have also changed the PCIe ACS override from multifunctional to disabled.

As far as I am aware, I am not doing anything out of the ordinary and this setup has been running months prior to the first errors, without incident. I won't dismiss that something may be failing, but the fact it is perfectly replicatable simply by starting a VM, makes me think there's something else at play. Any advice would be greatly appriciated!

Quote

July 31, 20223 yr

Author

[   97.777213] tun: Universal TUN/TAP device driver, 1.6
[   97.824344] mdcmd (36): check
[   97.824353] md: recovery thread: recon D1 D3 ...
[   98.034791] mpt3sas 0000:06:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM
[  201.059073] br0: port 2(vnet0) entered blocking state
[  201.059077] br0: port 2(vnet0) entered disabled state
[  201.059107] device vnet0 entered promiscuous mode
[  201.059189] br0: port 2(vnet0) entered blocking state
[  201.059190] br0: port 2(vnet0) entered forwarding state
[  221.602074] mpt2sas_cm0: SAS host is non-operational !!!!
[  222.627071] mpt2sas_cm0: SAS host is non-operational !!!!
[  223.650069] mpt2sas_cm0: SAS host is non-operational !!!!
[  224.674075] mpt2sas_cm0: SAS host is non-operational !!!!
[  225.699076] mpt2sas_cm0: SAS host is non-operational !!!!
[  226.722072] mpt2sas_cm0: SAS host is non-operational !!!!
[  226.722176] mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
[  226.727073] blk_update_request: I/O error, dev sdd, sector 41710208 op 0x0:(READ) flags 0x0 phys_seg 72 prio class 0
[  226.727080] md: disk0 read error, sector=41710144
[  226.727082] md: disk0 read error, sector=41710152
--- many of the same message ---
[  226.727147] md: disk0 read error, sector=41710704
[  226.727148] md: disk0 read error, sector=41710712
[  226.730101] sd 7:0:1:0: [sde] tag#2947 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s
[  226.730105] sd 7:0:1:0: [sde] tag#2947 CDB: opcode=0x88 88 00 00 00 00 00 02 7c 72 80 00 00 02 40 00 00
[  226.730106] blk_update_request: I/O error, dev sde, sector 41710208 op 0x0:(READ) flags 0x0 phys_seg 72 prio class 0
[  226.730110] md: disk29 read error, sector=41710144
[  226.730111] md: disk29 read error, sector=41710152
--- many of the same message ---
[  226.730166] md: disk29 read error, sector=41710704
[  226.730167] md: disk29 read error, sector=41710712
[  226.730182] sd 7:0:3:0: [sdg] tag#2948 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=DRIVER_OK cmd_age=5s
[  226.730184] sd 7:0:3:0: [sdg] tag#2948 CDB: opcode=0x88 88 00 00 00 00 00 02 7c 72 80 00 00 02 40 00 00
[  226.730185] blk_update_request: I/O error, dev sdg, sector 41710208 op 0x0:(READ) flags 0x0 phys_seg 72 prio class 0
[  226.730187] md: disk4 read error, sector=41710144
[  226.730188] md: disk4 read error, sector=41710152

So it's clear at 201, that I am starting the VM.

It's a lot of error messages, so I guess nothing too interesting, however there's one line that jumped out:

[  227.026975] pci 0000:06:00.0: Removing from iommu group 14

IOMMU Group 14 is indeed the disk controller. So why is this happening? How can I see how this is tied to the VM in any way because I can't connect the dots.

Quote

July 31, 20223 yr

Author

I have disabled IOMMU so it's unrelated to that, however in the logs, I have also noticed

Jul 31 12:44:10 HomeServer kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 31 12:44:11 HomeServer kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 31 12:44:12 HomeServer kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 31 12:44:13 HomeServer kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 31 12:44:14 HomeServer kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221103000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221103000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: enclosure logical id(0x590b11c01210fd00), slot(0)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221101000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: removing handle(0x0009), sas_addr(0x4433221101000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: enclosure logical id(0x590b11c01210fd00), slot(2)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221104000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221104000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: enclosure logical id(0x590b11c01210fd00), slot(7)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: removing handle(0x000c), sas_addr(0x4433221106000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: enclosure logical id(0x590b11c01210fd00), slot(5)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221105000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: removing handle(0x000d), sas_addr(0x4433221105000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: enclosure logical id(0x590b11c01210fd00), slot(6)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221107000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: removing handle(0x000e), sas_addr(0x4433221107000000)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: enclosure logical id(0x590b11c01210fd00), slot(4)
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: unexpected doorbell active!
Jul 31 12:44:15 HomeServer kernel: mpt2sas_cm0: sending diag reset !!
Jul 31 12:44:16 HomeServer kernel: mpt2sas_cm0: Invalid host diagnostic register value
Jul 31 12:44:16 HomeServer kernel: mpt2sas_cm0: System Register set:
Jul 31 12:44:16 HomeServer kernel: mpt2sas_cm0: diag reset: FAILED

I guess the best lead right now is there's something up with the SAS controller?

Quote

July 31, 20223 yr

Your diagnostics would provide more information.

Quote

December 31, 20223 yr

Author

Sorry to bring up such an old thread but I never got this solved and recently I have had some time on my hands!

Basically I am getting the same symptoms still. In the original post I mentioned about VMs, well this isn't just the case and happens anyway. I have replaced the SAS card and that has not helped.

I have also upgraded to the latest unraid as of today. To trigger this issue I am simply changing one disk from "unassigned" to an actual drive. When I do that, all the disks go unassigned and the similar syslog messages appear.

I have attached the diagnostics which you asked for before. Any help would be greatly appriciated! Failing drives are of course a possibility, though I just wouldn't expect this outcome.

homeserver-diagnostics-20221231-1721.zip

Quote

December 31, 20223 yr

Author

So I've spent a good hour reseating and dusting all the connections, and now it seems okay. I need to run some pre-clears and re-sync, if that all goes well then I assume this issue is simply doen to poor connections.

Quote

Array errors when a VM starts

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)