aterox

Members
  • Posts

    13
  • Joined

  • Last visited

aterox's Achievements

Newbie

Newbie (1/14)

0

Reputation

  1. Well, I downgraded to 6.8.2, which was working previously, and the issue still persists. I guess I'll see if turning off spin down works. Hopefully this will be fixed in future releases.
  2. I just looked back. The first time this happened was the day after the last release. I can't believe it never crossed my mind to downgrade. That's usually the first thing I think of. Thanks for all your help!
  3. It concerns me that these issues just suddenly popped up after two years of working perfectly. If this was a newly built server, it would have made sense to just leave spin down disabled and just say it's a non-issue glitch with the hardware. I would think it wouldn't just suddenly break and persist through hardware changes though?
  4. Okay. I decided to go ahead and swap out the motherboard as well, since the HBA was integrated. Found another board with an integrated controller and finally got it installed after a long time waiting for shipping. The issues still persist, however I did notice something that seems to be causing it. Apr 28 06:52:53 CaptainMarvel kernel: sd 8:0:3:0: [sdg] Synchronizing SCSI cache Apr 28 06:52:53 CaptainMarvel kernel: sd 8:0:3:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 This happens after the disk is spun down, after some reads or writes. The disk is then taken offline, but isn't detected and disabled until it's accessed again. I looked back through the previous diagnostics I've had for these issues and all the failed drives have this error and then are disabled later when they're accessed again. I've managed to reproduce the error manually and have attached the diagnostics with these errors on disk 9. I haven't been able to find much of anything online on this, other than things that I've already tried or swapped. I've got new RAM arriving today, and will be trying that even though the current RAM did pass memtest. Any thoughts? Thanks! captainmarvel-diagnostics-20200428-0658.zip
  5. I had originally thought it was the power supply. But that's been replaced with a new more than sufficient part. It's on a UPS also, so it shouldn't be suffering from ant brown outs or anything like that either. Hopefully the HBA will be it
  6. Thanks for the help. I will order one and give it a go. I figured it had to be something like that, but I wanted to check with people who have done this more than me before I went ahead and bought more new parts. Thanks again!
  7. It has happened again. I believe the syslog shouldn't have the nginx errors this time. Thanks! captainmarvel-diagnostics-20200417-0526.zip
  8. I will try to capture it better next time, something must have happened while I was trying to download it. I checked the syslog before downloading the diagnostics. It was spammed with read errors, then write errors. The gui did have a log of udma crc errors that then cleared after being reassigned to the array. Thanks for the help so far.
  9. Well, it happened again. Disk 8 this time, which is one of my newest drives. The drive didn't go completely offline, but still disabled. I've attached current diagnostics. Any thoughts, or should I try a new HBA? Thanks for all the help! captainmarvel-diagnostics-20200416-0215.zip
  10. I've updated the firmware. I'll post back with results. Thanks for all help so far!
  11. I had thought that I had used the most current firmware when I flashed to IT mode. I'm having trouble finding 20.00.07.00 firmware for the board. I built the system based on serverbuilds.net Anniversary 1.0 build, with a Gigabyte GA-7PESH2. The motherboard has onboard LSI 2008. Which I assume is the base chip, not what the firmware is for, since I can't find anything. I found a firmware for 9210-8i and 9211-8i. Would these work? Thanks again for the help.
  12. No disks have been moved. The diagnostics zip doesn't show 1 and 3, because I pulled it right after the errors occurred, which usually takes the disks completely offline. It seems to happen most often on 1 and 3, and these errors have shown up withing the past week or so. I gues the long ago errors on 9 may be something else. I probably remembered which it happened on incorrectly. My thoughts on the controller were the same. It should have all disks error at the same time. The disks are plugged into the same SAS expander. But so are all the other disks. I just changed the expander out as well. I believe 1, 2, and 3 are all plugged into the same port. It does seem to happen most often on those disks, I assume because they're being used the most often right now, because they're next on the high water fill. I'm going try a different port on the expander and see if that's it.
  13. Hi, I've recently had an issue where several of my disks randomly disable themselves. At first I thought the disks were bad, but they don't have any smart errors, and work fine for up to several days before the problem comes back. The weird part to me is that it's not always the same disks. It seems to be focused on disks 1-3, which I think are the disks that are currently being written to the most, but it's also happened to disks 8 and 9. I've replaced all the cables, the HP SAS expander, and the power supply. None of which have worked. I've also checked the RAM with memtest, which showed no errors. The drives pass SMART tests, and the reports seem to indicate no errors to my knowledge. I've attached diagnostics from a few minutes after the most recent occurrence, and the missing smart reports from the disks after rebooting. I currently have all the drives that have had the issue running an extended SMART test. I'm looking to see if you guys spot anything that could be causing it before I look at getting a replacement HBA or swapping the motherboard entirely since it's built in. Thanks for any help! captainmarvel-diagnostics-20200414-2203.zip captainmarvel-smart-20200414-2216.zip captainmarvel-smart-20200414-2217.zip