aterox Posted April 15, 2020 Share Posted April 15, 2020 (edited) Hi, I've recently had an issue where several of my disks randomly disable themselves. At first I thought the disks were bad, but they don't have any smart errors, and work fine for up to several days before the problem comes back. The weird part to me is that it's not always the same disks. It seems to be focused on disks 1-3, which I think are the disks that are currently being written to the most, but it's also happened to disks 8 and 9. I've replaced all the cables, the HP SAS expander, and the power supply. None of which have worked. I've also checked the RAM with memtest, which showed no errors. The drives pass SMART tests, and the reports seem to indicate no errors to my knowledge. I've attached diagnostics from a few minutes after the most recent occurrence, and the missing smart reports from the disks after rebooting. I currently have all the drives that have had the issue running an extended SMART test. I'm looking to see if you guys spot anything that could be causing it before I look at getting a replacement HBA or swapping the motherboard entirely since it's built in. Thanks for any help! captainmarvel-diagnostics-20200414-2203.zip captainmarvel-smart-20200414-2216.zip captainmarvel-smart-20200414-2217.zip Edited April 15, 2020 by aterox Quote Link to comment
Dissones4U Posted April 15, 2020 Share Posted April 15, 2020 (edited) 1 hour ago, aterox said: I've also checked the RAM with memtest Quote Total Width: 72 bits Data Width: 64 bits My understanding is that memtest wont find errors with Error Correction Code memory. Multiple disks at once can be indicative of the controller but I'd expect all disks on the controller to error not just two. The diagnostics zip file doesn't show disks (1 and 3) in the SMART which means they were not connected at the time of the zip download. Disk 8 is helium and I'm not sure how to interpret the data provided in the SMART but Disk 9 clearly shows the faults you've described above. However, the Disk 9 faults were a long time ago based on the current power on hours being over 15k (the faults happened at <3k. Did you move the problematic disks 8 and 9 and then put 1 and 3 in their place? Looking at diagnostics.zip ==>system ==> vars.txt both disks 1 and 3 show 18 errors each so my guess is that whatever they're plugged into is the problem. Edited April 15, 2020 by Dissones4U Quote Link to comment
aterox Posted April 15, 2020 Author Share Posted April 15, 2020 No disks have been moved. The diagnostics zip doesn't show 1 and 3, because I pulled it right after the errors occurred, which usually takes the disks completely offline. It seems to happen most often on 1 and 3, and these errors have shown up withing the past week or so. I gues the long ago errors on 9 may be something else. I probably remembered which it happened on incorrectly. My thoughts on the controller were the same. It should have all disks error at the same time. The disks are plugged into the same SAS expander. But so are all the other disks. I just changed the expander out as well. I believe 1, 2, and 3 are all plugged into the same port. It does seem to happen most often on those disks, I assume because they're being used the most often right now, because they're next on the high water fill. I'm going try a different port on the expander and see if that's it. Quote Link to comment
JorgeB Posted April 15, 2020 Share Posted April 15, 2020 First thing to do is to update the HBA firmware: Apr 14 07:20:15 CaptainMarvel kernel: mpt2sas_cm0: LSISAS2008: FWVersion(20.00.04.00) All p20 firmware releases except latest (20.00.07.00) have known issues. 1 Quote Link to comment
aterox Posted April 15, 2020 Author Share Posted April 15, 2020 (edited) 2 hours ago, johnnie.black said: First thing to do is to update the HBA firmware: Apr 14 07:20:15 CaptainMarvel kernel: mpt2sas_cm0: LSISAS2008: FWVersion(20.00.04.00) All p20 firmware releases except latest (20.00.07.00) have known issues. I had thought that I had used the most current firmware when I flashed to IT mode. I'm having trouble finding 20.00.07.00 firmware for the board. I built the system based on serverbuilds.net Anniversary 1.0 build, with a Gigabyte GA-7PESH2. The motherboard has onboard LSI 2008. Which I assume is the base chip, not what the firmware is for, since I can't find anything. I found a firmware for 9210-8i and 9211-8i. Would these work? Thanks again for the help. Edited April 15, 2020 by aterox Quote Link to comment
JorgeB Posted April 15, 2020 Share Posted April 15, 2020 19 minutes ago, aterox said: Would these work? They should, either one. Quote Link to comment
aterox Posted April 16, 2020 Author Share Posted April 16, 2020 I've updated the firmware. I'll post back with results. Thanks for all help so far! Quote Link to comment
aterox Posted April 16, 2020 Author Share Posted April 16, 2020 18 hours ago, johnnie.black said: They should, either one. Well, it happened again. Disk 8 this time, which is one of my newest drives. The drive didn't go completely offline, but still disabled. I've attached current diagnostics. Any thoughts, or should I try a new HBA? Thanks for all the help! captainmarvel-diagnostics-20200416-0215.zip Quote Link to comment
JorgeB Posted April 16, 2020 Share Posted April 16, 2020 Can't see the disk error, syslog is spammed with Nginx errors and was not logging anymore: Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [crit] 10664#10664: ngx_slab_alloc() failed: no memory Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: shpool alloc failed Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory. Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: *139405 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost" Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [crit] 10664#10664: ngx_slab_alloc() failed: no memory Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: shpool alloc failed Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory. Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: *139411 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost" Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks Quote Link to comment
aterox Posted April 16, 2020 Author Share Posted April 16, 2020 (edited) 36 minutes ago, johnnie.black said: Can't see the disk error, syslog is spammed with Nginx errors and was not logging anymore: Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [crit] 10664#10664: ngx_slab_alloc() failed: no memory Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: shpool alloc failed Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory. Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: *139405 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost" Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [crit] 10664#10664: ngx_slab_alloc() failed: no memory Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: shpool alloc failed Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory. Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: *139411 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost" Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks I will try to capture it better next time, something must have happened while I was trying to download it. I checked the syslog before downloading the diagnostics. It was spammed with read errors, then write errors. The gui did have a log of udma crc errors that then cleared after being reassigned to the array. Thanks for the help so far. Edited April 16, 2020 by aterox Quote Link to comment
aterox Posted April 17, 2020 Author Share Posted April 17, 2020 On 4/16/2020 at 5:23 AM, johnnie.black said: Can't see the disk error, syslog is spammed with Nginx errors and was not logging anymore: Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [crit] 10664#10664: ngx_slab_alloc() failed: no memory Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: shpool alloc failed Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory. Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: *139405 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost" Apr 15 17:28:22 CaptainMarvel nginx: 2020/04/15 17:28:22 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [crit] 10664#10664: ngx_slab_alloc() failed: no memory Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: shpool alloc failed Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: nchan: Out of shared memory while allocating message of size 6156. Increase nchan_max_reserved_memory. Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: *139411 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost" Apr 15 17:28:23 CaptainMarvel nginx: 2020/04/15 17:28:23 [error] 10664#10664: MEMSTORE:00: can't create shared message for channel /disks It has happened again. I believe the syslog shouldn't have the nginx errors this time. Thanks! captainmarvel-diagnostics-20200417-0526.zip Quote Link to comment
JorgeB Posted April 17, 2020 Share Posted April 17, 2020 This looks like a power/connection issue, and it's affecting multiple disks, if you already replaced cables and PSU next step would be trying a different HBA. Quote Link to comment
aterox Posted April 17, 2020 Author Share Posted April 17, 2020 Thanks for the help. I will order one and give it a go. I figured it had to be something like that, but I wanted to check with people who have done this more than me before I went ahead and bought more new parts. Thanks again! Quote Link to comment
JorgeB Posted April 17, 2020 Share Posted April 17, 2020 The HBA would be one of the last suspects for me, power related would be my first, but if you replaced practically everything else... Quote Link to comment
aterox Posted April 17, 2020 Author Share Posted April 17, 2020 I had originally thought it was the power supply. But that's been replaced with a new more than sufficient part. It's on a UPS also, so it shouldn't be suffering from ant brown outs or anything like that either. Hopefully the HBA will be it Quote Link to comment
aterox Posted April 28, 2020 Author Share Posted April 28, 2020 (edited) Okay. I decided to go ahead and swap out the motherboard as well, since the HBA was integrated. Found another board with an integrated controller and finally got it installed after a long time waiting for shipping. The issues still persist, however I did notice something that seems to be causing it. Apr 28 06:52:53 CaptainMarvel kernel: sd 8:0:3:0: [sdg] Synchronizing SCSI cache Apr 28 06:52:53 CaptainMarvel kernel: sd 8:0:3:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 This happens after the disk is spun down, after some reads or writes. The disk is then taken offline, but isn't detected and disabled until it's accessed again. I looked back through the previous diagnostics I've had for these issues and all the failed drives have this error and then are disabled later when they're accessed again. I've managed to reproduce the error manually and have attached the diagnostics with these errors on disk 9. I haven't been able to find much of anything online on this, other than things that I've already tried or swapped. I've got new RAM arriving today, and will be trying that even though the current RAM did pass memtest. Any thoughts? Thanks! captainmarvel-diagnostics-20200428-0658.zip Edited April 28, 2020 by aterox Quote Link to comment
JorgeB Posted April 28, 2020 Share Posted April 28, 2020 I've seen some logged errors recently with LSI and spin downs/spin ups, but they appear harmless, not actual errors, and never disable a disk, but you could try disabling spin down for a couple of days and see it it makes any difference. Quote Link to comment
aterox Posted April 28, 2020 Author Share Posted April 28, 2020 (edited) It concerns me that these issues just suddenly popped up after two years of working perfectly. If this was a newly built server, it would have made sense to just leave spin down disabled and just say it's a non-issue glitch with the hardware. I would think it wouldn't just suddenly break and persist through hardware changes though? Edited April 28, 2020 by aterox Quote Link to comment
JorgeB Posted April 28, 2020 Share Posted April 28, 2020 The LSI spin up/spin down errors started only on the latest Unraid releases. Quote Link to comment
aterox Posted April 28, 2020 Author Share Posted April 28, 2020 I just looked back. The first time this happened was the day after the last release. I can't believe it never crossed my mind to downgrade. That's usually the first thing I think of. Thanks for all your help! Quote Link to comment
aterox Posted April 28, 2020 Author Share Posted April 28, 2020 (edited) Well, I downgraded to 6.8.2, which was working previously, and the issue still persists. I guess I'll see if turning off spin down works. Hopefully this will be fixed in future releases. Edited April 28, 2020 by aterox Quote Link to comment
JorgeB Posted April 28, 2020 Share Posted April 28, 2020 Like mentioned I've seen errors logged before, but never disks getting disabled because of that, still worth trying. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.