Disabled Disks. Possible HBA issue?


Recommended Posts

Hi,

 

I've recently had an issue where several of my disks randomly disable themselves. At first I thought the disks were bad, but they don't have any smart errors, and work fine for up to several days before the problem comes back. The weird part to me is that it's not always the same disks. It seems to be focused on disks 1-3, which I think are the disks that are currently being written to the most, but it's also happened to disks 8 and 9.

 

I've replaced all the cables, the HP SAS expander, and the power supply. None of which have worked. I've also checked the RAM with memtest, which showed no errors. The drives pass SMART tests, and the reports seem to indicate no errors to my knowledge. I've attached diagnostics from a few minutes after the most recent occurrence, and the missing smart reports from the disks after rebooting. I currently have all the drives that have had the issue running an extended SMART test.

 

I'm looking to see if you guys spot anything that could be causing it before I look at getting a replacement HBA or swapping the motherboard entirely since it's built in.

 

Thanks for any help!

captainmarvel-diagnostics-20200414-2203.zip captainmarvel-smart-20200414-2216.zip captainmarvel-smart-20200414-2217.zip

Edited by aterox
Link to comment
1 hour ago, aterox said:

I've also checked the RAM with memtest

Quote

Total Width: 72 bits
Data Width: 64 bits

My understanding is that memtest wont find errors with Error Correction Code memory. Multiple disks at once can be indicative of the controller but I'd expect all disks on the controller to error not just two. The diagnostics zip file doesn't show disks (1 and 3) in the SMART which means they were not connected at the time of the zip download. Disk 8 is helium and I'm not sure how to interpret the data provided in the SMART but Disk 9 clearly shows the faults you've described above. However, the Disk 9 faults were a long time ago based on the current power on hours being over 15k (the faults happened at <3k. Did you move the problematic disks 8 and 9 and then put 1 and 3 in their place? Looking at diagnostics.zip ==>system ==> vars.txt both disks 1 and 3 show 18 errors each so my guess is that whatever they're plugged into is the problem.

Edited by Dissones4U
Link to comment

No disks have been moved. The diagnostics zip doesn't show 1 and 3, because I pulled it right after the errors occurred, which usually takes the disks completely offline. It seems to happen most often on 1 and 3, and these errors have shown up withing the past week or so. I gues the long ago errors on 9 may be something else. I probably remembered which it happened on incorrectly.

 

My thoughts on the controller were the same. It should have all disks error at the same time.

The disks are plugged into the same SAS expander. But so are all the other disks. I just changed the expander out as well.

I believe 1, 2, and 3 are all plugged into the same port. It does seem to happen most often on those disks, I assume because they're being used the most often right now, because they're next on the high water fill.

 

I'm going try a different port on the expander and see if that's it.

Link to comment
2 hours ago, johnnie.black said:

First thing to do is to update the HBA firmware:


Apr 14 07:20:15 CaptainMarvel kernel: mpt2sas_cm0: LSISAS2008: FWVersion(20.00.04.00)

 

All p20 firmware releases except latest (20.00.07.00) have known issues.

 

 

 

I had thought that I had used the most current firmware when I flashed to IT mode.

I'm having trouble finding 20.00.07.00 firmware for the board. I built the system based on serverbuilds.net Anniversary 1.0 build, with a Gigabyte GA-7PESH2. The motherboard has onboard LSI 2008. Which I assume is the base chip, not what the firmware is for, since I can't find anything. I found a firmware for 9210-8i and 9211-8i. Would these work?

 

Thanks again for the help.

Edited by aterox
Link to comment

Can't see the disk error, syslog is spammed with Nginx errors and was not logging anymore:

Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [crit] 10664#10664: ngx_slab_alloc() failed: no memory
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: shpool alloc failed
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory.
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: *139405 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [crit] 10664#10664: ngx_slab_alloc() failed: no memory
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: shpool alloc failed
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory.
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: *139411 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks

 

Link to comment
36 minutes ago, johnnie.black said:

Can't see the disk error, syslog is spammed with Nginx errors and was not logging anymore:


Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [crit] 10664#10664: ngx_slab_alloc() failed: no memory
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: shpool alloc failed
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory.
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: *139405 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [crit] 10664#10664: ngx_slab_alloc() failed: no memory
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: shpool alloc failed
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory.
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: *139411 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks

 

I will try to capture it better next time, something must have happened while I was trying to download it. I checked the syslog before downloading the diagnostics. It was spammed with read errors, then write errors.

The gui did have a log of udma crc errors that then cleared after being reassigned to the array.

Thanks for the help so far.

Edited by aterox
Link to comment
On 4/16/2020 at 5:23 AM, johnnie.black said:

Can't see the disk error, syslog is spammed with Nginx errors and was not logging anymore:


Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [crit] 10664#10664: ngx_slab_alloc() failed: no memory
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: shpool alloc failed
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory.
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: *139405 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [crit] 10664#10664: ngx_slab_alloc() failed: no memory
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: shpool alloc failed
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory.
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: *139411 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks

 

It has happened again. I believe the syslog shouldn't have the nginx errors this time.

Thanks!

captainmarvel-diagnostics-20200417-0526.zip

Link to comment
  • 2 weeks later...

Okay. I decided to go ahead and swap out the motherboard as well, since the HBA was integrated. Found another board with an integrated controller and finally got it installed after a long time waiting for shipping.

The issues still persist, however I did notice something that seems to be causing it. 

Apr 28 06:52:53 CaptainMarvel kernel: sd 8:0:3:0: [sdg] Synchronizing SCSI cache
Apr 28 06:52:53 CaptainMarvel kernel: sd 8:0:3:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00

This happens after the disk is spun down, after some reads or writes. The disk is then taken offline, but isn't detected and disabled until it's accessed again. I looked back through the previous diagnostics I've had for these issues and all the failed drives have this error and then are disabled later when they're accessed again.

 

I've managed to reproduce the error manually and have attached the diagnostics with these errors on disk 9.

I haven't been able to find much of anything online on this, other than things that I've already tried or swapped. I've got new RAM arriving today, and will be trying that even though the current RAM did pass memtest.

Any thoughts?

 

Thanks!

captainmarvel-diagnostics-20200428-0658.zip

Edited by aterox
Link to comment

It concerns me that these issues just suddenly popped up after two years of working perfectly. If this was a newly built server, it would have made sense to just leave spin down disabled and just say it's a non-issue glitch with the hardware. I would think it wouldn't just suddenly break and persist through hardware changes though?

Edited by aterox
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.