Jump to content
xRadeon

AMD Epyc & LSI 9220-8i Unstable Drive Detection

4 posts in this topic Last Reply

Recommended Posts

Specs:

Unraid: 6.7.2

Motherboard/CPU: Super micro M11SDV-8C-LN4F (AMD Epyc 3251 SoC)

RAM: 2x 16Gb ECC (Tested good via Memtest86+ 5.01)

BIOS, IPMI: 1.0a, 3.13

HBA: LSI 9220-8i (IT mode FW 20.00.07.00, no BIOS installed I think, I'd have to check if it's still there, I may have erased it and not put it back on)

Drives: 5x HGST_HDN721010ALE604 & 4x INTEL_SSDSC2KW512G8 (8x plugged into HBA, 1x plugged into Motherboard)

 

Issue:

Hello Everyone,

I have a very strange issue with a new motherboard I'm trying to use for my Unraid build and I was hoping I could get some help.

The issue I'm running into are drives attached to my LSI HBA are generally not readable or detected (by this I mean they either do not appear at all in the web ui or they are detected but cannot be accessed for some reason in the syslog output and they also will not appear in the web ui). The issue always occurs if after I power on the system and have booted into Unraid, I then reboot the system and after it boots the drives will not be detected. If reboot over and over they still will not be detected. If I shutdown the system and power it back up, they drives are then detected again. Sometimes even if the drives are detected, when I go to start the array it will not start since the drives cannot be accessed. The single drive I have plugged into the motherboard works fine every time, I can always see it in the web ui.

 

I suspect this is a BIOS/board issue but I just want to rule out a driver issue or kernel issue with these Eypc 3000 CPUs in Unraid.

 

Troubleshooting:

I have tried testing different settings in the BIOS to no effect on the issue. For example, Legacy vs UEFI boot, Above 4G Access, IOMMU, Virtualization, Precision timing, consistent pci device naming, etc. I've tried different combinations of BIOS settings but nothing seems to have any impact on the issue at all.

 

I know the HBA card is good since I have a Super micro A1SRi-2758F board that works flawlessly, I can reboot it as many times as I wish and the drives are always detected.

 

When the drives are not detected, I generally see this in the syslog:

Aug 24 22:05:01 Island kernel: mpt2sas_cm0: _base_wait_for_doorbell_not_used: failed due to timeout count(5000), doorbell_reg(ffffffff)!
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: Allocated physical memory: size(1687 kB)
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: Current Controller Queue Depth(3364),Max Controller Queue Depth(3432)
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: Scatter Gather Elements per IO(128)
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: doorbell is in use (line=5195)
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: _base_send_ioc_init: handshake failed (r=-14)
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: sending diag reset !!
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: diag reset: FAILED
Aug 24 22:05:01 Island kernel: mpt2sas_cm0: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:10651/_scsih_probe()!

 

Or

Aug 30 22:28:07 Island kernel: mpt2sas_cm0: sending diag reset !!
Aug 30 22:28:07 Island kernel: mpt2sas_cm0: diag reset: FAILED
Aug 30 22:28:07 Island kernel: scsi target1:0:4: target reset: FAILED scmd(00000000f7e73a1b)
Aug 30 22:28:07 Island kernel: mpt2sas_cm0: attempting host reset! scmd(00000000f7e73a1b)
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: tag#0 CDB: opcode=0x1a 1a 00 3f 00 04 00
Aug 30 22:28:07 Island kernel: mpt2sas_cm0: Blocking the host reset
Aug 30 22:28:07 Island kernel: mpt2sas_cm0: host reset: FAILED scmd(00000000f7e73a1b)
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: Device offlined - not ready after error recovery
Aug 30 22:28:07 Island kernel: scsi 1:0:5:0: Device offlined - not ready after error recovery
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: [sdg] Write Protect is off
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: [sdg] Mode Sense: 00 00 00 00
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: rejecting I/O to offline device
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: [sdg] Asking for cache data failed
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: [sdg] Assuming drive cache: write through
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: rejecting I/O to offline device
Aug 30 22:28:07 Island kernel: sd 1:0:4:0: [sdg] Attached SCSI disk
Aug 30 22:28:07 Island kernel: mpt2sas_cm0: _config_request: waiting for operational state(count=1)

 

I have opened a Super micro case, but after a few back and forth emails they abruptly closed my case, so I'd doubt they're going to be much help. They did say no one has reported an issue like this, so I could just have a bad board.

 

I have thought about purchasing an LSI 9300-8i, but I don't want to dump 175 bucks if I still have the same issue. I may still buy it since these LSI SAS2008 cards are getting somewhat old and have stopped getting FW updates.

 

Logs:

I've attached multiple logs and diagnostics, here are some descriptions:

island-syslog-20190824-1009.zip: Drives not detected/usable.
island-syslog-20190822-0054.zip: Drives detected/usable.
island-diagnostics-20190822-0053.zip: Drives detected/usable.
island-diagnostics-20190830-1031.zip: Drives not detected/usable.
island-syslog-20190830-1031.zip: Drives not detected/usable.

 

Help me unraid forum, you're my only hope...

 

Thanks!

island-syslog-20190824-1009.zip island-syslog-20190822-0054.zip island-diagnostics-20190822-0053.zip island-diagnostics-20190830-1031.zip island-syslog-20190830-1031.zip

Share this post


Link to post

Try another PCI-e slot on the MB.  My guess would be that LSI card is not getting reset when you reboot.  If you Google  lsi 9220 problems  , you will find that it was a solution to a basically same problem as you are having. 

Share this post


Link to post

The board is a mini-itx board so I only got one pci slot. I'll do some digging on that Google search and see if just getting that 9300-8i would solve the problem. There's also a BIOS option that talks about how to split the PCI slot if using a riser card, not sure if I played with that option or not. I'll try it out and see what I find. Thanks for the pointer!

Share this post


Link to post

If worst comes to worst, you do have the option of instead of rebooting, just power down and restart.  Of course, this will require direct access to the server rather than a remote reboot...

Edited by Frank1940

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.