September 1, 20241 yr 12 minutes ago, prongATO said: First, man, your drives get toasty! The highest I’ve ever seen any drive in my system is 38c. Yes, bad cooling and room temp 30c 😅
September 2, 20241 yr Author On 9/1/2024 at 1:17 AM, Vr2Io said: Checking on different log, there are no clear reason for disk drop, just random time. Once disk drop in array, you will got below in syslog and disk reconnect within 3 second and change drive letter, this was normal disk drop result. Aug 24 11:42:49 MRPUNRFSX1 kernel: sd 1:0:1:0: device_block, handle(0x0009) Aug 24 11:42:51 MRPUNRFSX1 kernel: sd 1:0:1:0: device_unblock and setting to running, handle(0x0009) Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Attached scsi generic sg8 type 0 Aug 24 11:42:55 MRPUNRFSX1 kernel: sd 1:0:5:0: Power-on or device reset occurred Aug 24 11:42:57 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL ()' is not set to auto mount. ----- Aug 23 13:57:35 MRPUNRFSX1 kernel: sd 3:0:4:0: device_block, handle(0x000d) Aug 23 13:57:37 MRPUNRFSX1 kernel: sd 3:0:4:0: device_unblock and setting to running, handle(0x000d) Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Attached scsi generic sg11 type 0 Aug 23 13:57:41 MRPUNRFSX1 kernel: sd 3:0:5:0: Power-on or device reset occurred Aug 23 13:57:43 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (dev1)' is not set to auto mount. ----- Aug 23 16:57:20 MRPUNRFSX1 kernel: sd 4:0:3:0: device_block, handle(0x000c) Aug 23 16:57:22 MRPUNRFSX1 kernel: sd 4:0:3:0: device_unblock and setting to running, handle(0x000c) Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Attached scsi generic sg10 type 0 Aug 23 16:57:29 MRPUNRFSX1 kernel: sd 4:0:5:0: Power-on or device reset occurred Aug 23 16:57:30 MRPUNRFSX1 unassigned.devices: Disk with ID 'WDC_WD60EFPX-68C5ZN0_WD-WX52D53L37PL (sdu)' is not set to auto mount. (1) Pls ensure drop disk fall to UD, you can mount it and read the file content, then you actually not need rebuild it. And after problem fixing, then re-add to array and rebuild-parity. (2) Stop disk auto spin-down. (3) Open a terminal, type "dmesg -wT" to keep track does any "I/O error" or "tail -f /var/log/syslog" to check any device_block & device_unblock. You can press "Enter" anytime for easy check any new log after that. (4) Parity operation should generate maximum I/O and useful stress disk to drop-out, but in this case seems not. So I will suggest some different test hop can reproduce problem ASAP. But I won't know what will trigger this, so just try it. Pls open 3 terminal, execute below command, concurrent two hash for big file should provide enough loading to CPU, the 3rd one just make another hash for small file. find /mnt/disk5 -type f -size +8G -exec b3sum {} \; hash file on disk5 which large then 8GB. find /mnt/disk6 -type f -size +8G -exec b3sum {} \; hash file on disk6 which large then 8GB. find /mnt/user -type f -size -5k -print0 | sort -Rz | xargs -0 -n1 b3sum Hash file on user share ( random sort ) which smaller then 5k. While the memtest was running I removed the 3rd 5 in 3 and thoroughly cleaned it out. I ordered a new fan to put in it and will likely do that for all 4 cages if it works well. After I reinstalled the 5 in 3 it rebuilt disk 12 with no issue. I still have the server in a state were I can service the cages. I will stress test it to see if we can make it error again on Wednesday. Right now, the parity is valid and the array is green. Diagnostics from the latest boot attached. (no disk errors, so far) mrpunrfsx1-diagnostics-20240902-1825.zip
September 2, 20241 yr Author @Vr2Io do you want me to start another parity check to see if we can force another error? I might be able to pull full diagnostics if one comes up.
September 2, 20241 yr 2 minutes ago, prongATO said: @Vr2Io do you want me to start another parity check to see if we can force another error? I might be able to pull full diagnostics if one comes up. Actually this fine for start another parity check, this is current know method to trigger disk drop until you found a quickest way. Edited September 2, 20241 yr by Vr2Io
September 3, 20241 yr 5 minutes ago, prongATO said: Just noticed this on a reboot Never found this type / form error, likely HBA or Mobo related, because disks won't cause this. How many disk detect when booting in Unraid ? Edited September 3, 20241 yr by Vr2Io
September 3, 20241 yr Author 5 minutes ago, Vr2Io said: Never found this type / form error, likely HBA or Mobo related, because disks won't cause this. How many disk detect when booting in Unraid ? I have all 8 populated on the LBA card. There is also an LBA card built into the MB. All are full at the moment. On a reboot the error is cleared Edited September 3, 20241 yr by prongATO
September 3, 20241 yr 27 minutes ago, prongATO said: I have all 8 populated on the LBA card. There is also an LBA card built into the MB. All are full at the moment. Just notice you have onboard (2008) (2308) and addon HBA (2308) (2008). Need check does disk drop at either HBA or cross. Edited September 3, 20241 yr by Vr2Io
September 3, 20241 yr Author 6 minutes ago, Vr2Io said: Just notice you have onboard (2008) and addon HBA (2308). Need check does disk drop only at onboard HBA. I thought that was the issue at first, overall with all this, I've had most of the errors on the LBA card, that's why I replaced it already but the error we got on disk 16 was connected to the MB at the time.
September 3, 20241 yr Through you have more then 8 disks and you need two controller, can't isolate the controller to test. Because I thinking if a fault controller will affect another good one, but this I can't confirm. In normal booting, does LSI BIOS will show two controller in same list ? The error message said addon ( 2008, 01:00.0 ) fail, but onboard ( 2308, 02:00.0 ) fine. What are the another addon HBA on hand ? Edited September 3, 20241 yr by Vr2Io
September 3, 20241 yr Author 24 minutes ago, Vr2Io said: Through you have more then 8 disks and you need two controller, can't isolate the controller to test. Because I thinking if a fault controller will affect another good one, but this I can't confirm. In normal booting, does LSI BIOS will show two controller in same list ? The error message said addon ( 2008, 01:00.0 ) fail, but onboard ( 2308, 02:00.0 ) fine. yes, there are two discrete controllers. The 9211 and 2308. I'm flashing them with updated firmware now.
September 3, 20241 yr 2 minutes ago, prongATO said: yes, there are two discrete controllers. The 9211 and 2308. I'm flashing them with updated firmware now. Great if you crossflash addon 9211 ( 2008 IR ) to ( 2008 IT ) for best combination. Edited September 3, 20241 yr by Vr2Io
September 3, 20241 yr Author 3 minutes ago, Vr2Io said: Great if you crossflash addon 9211 ( 2008 IR ) to ( 2008 IT ) for best combination. They're both flashed to IT mode and both are the flashed to the most stable version 20.00.07.00 firmware and the bios has been updated Edited September 3, 20241 yr by prongATO
September 3, 20241 yr Author diagnostics from the latest boot for reference. mrpunrfsx1-diagnostics-20240902-2010.zip
September 3, 20241 yr As no problem found after booting in Unraid, so you may ignore the BIOS error and start parity check. If reboot again, does BIOS always show same error ? What are the another addon HBA onhand ?
September 3, 20241 yr 27 minutes ago, prongATO said: They're both flashed to IT mode and both are the flashed to the most stable verison 20.00.07.00 firmware and the bios has been updated I haven't any SAS2008 in system, but your one show RAID bus controller, it look like IR firmware, anyway this doesn't matter. Below is my previous SAS2008 HBA look like. Edited September 3, 20241 yr by Vr2Io
September 3, 20241 yr Author 6 minutes ago, Vr2Io said: I haven't any SAS2008 in system, but your one show RAID bus controller, it look like IR firmware, anyway this doesn't matter. Below is my previous SAS2008 HBA look like.
September 3, 20241 yr Author Fingers crossed, I started the parity check last night and it's 75% done with no errors. Surely this all didn't happen because of a dirty 5 in 3 cage?! Yeah, it was a bit dusty and I took it COMPLETELY apart to clean it (complete tear down, clean and reassemble). Maybe the firmware flashes and bios for the LBAs helped also. Diagnostics attached for reference mrpunrfsx1-diagnostics-20240903-1457.zip Edited September 3, 20241 yr by prongATO
September 4, 20241 yr Author I'm VERY encouraged by today's results; parity check finished in just over 13 hours with no errors. I removed the 5 in 3 cage that was connected to the most errors, completely cleaned the whole device out, checked all the solder joints, replaced the fan with a Noctua fan that can now run 24/7 to keep the drives cooler and reassembled the unit. Again I'm not sure if it was some dust that got into the SATA ports on the backplane of the 5 in 3 or if it was flashing both LBA (onboard and PCIE) with the 20.00.07.00 firmware and updating the BIOS or the work I did on the cages (I took cage 4 apart and cleaned it earlier too) but everything seems to be running fine now. No more random disk errors in the bottom two 5 in 3 cages. I'm going to do the fan modification to the other cages in the coming weeks as well as cleaning all of them thoroughly too. I'll give it a week or two and if the system continues to run well, I'll close this as solved and reference cleaning the cages and updating the firmware on the cards as the solution. Thank you to every one that's helped me with this process. Latest diagnostics attached for reference. mrpunrfsx1-diagnostics-20240903-2114.zip
September 4, 20241 yr Author Well, unfortunately I woke up to more errors today. Still disk 12. mrpunrfsx1-diagnostics-20240904-0835.zip
September 4, 20241 yr Author On 9/2/2024 at 7:18 PM, Vr2Io said: Just notice you have onboard (2008) (2308) and addon HBA (2308) (2008). Need check does disk drop at either HBA or cross. Okay, taking your advice here. After the latest errors on disk 12, I switched the data connection from the LBA card and switched it to a SATA port on the motherboard. Performing rebuild now. None of this makes any sense. For two days it was hooked up to the LBA card and successfully did a full parity check with no problems. Then, what looks like during the mover script, it has errors early this morning.
September 4, 20241 yr That's why I said you need found a way to quickly trigger disk drop. Otherwise you just fall to endless rebuild loop. Edited September 4, 20241 yr by Vr2Io
September 4, 20241 yr Author 6 hours ago, Vr2Io said: That's why I said you need found a way to quickly trigger disk drop. Otherwise you just fall to endless rebuild loop. I can't figure out what triggers the errors. I honestly thought that it might have been the backplane of the HDD cage being dirty. The server hadn't made it completely through a parity check in months, until the other day. When this problem started, a drive would report errors and I'd rebuild it and it'd be fine for a few months. Then it got to a few weeks, now it's a few days. I went through my normal hardware troubleshooting and replaced the breakout cables, then the LBA card, then the processor, then the power supply and cables. Really the only thing left is the motherboard. I have a bid on a used Supermicro X10SRL-F and will see in a few days if I can get it. Edited September 4, 20241 yr by prongATO
September 4, 20241 yr How about previous provide test result ? There are no feedback. Currently we can't confirm does problem come from hardware, it could be software relate, so we need different test or setup to verify. Edited September 5, 20241 yr by Vr2Io
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.