fritzdis Posted January 18 Share Posted January 18 (edited) Overnight, my server (on 6.12.6) apparently encountered significant errors, to the point where multiple drives became unavailable. I suspect this may be related to the HBA card to which the drives were connected. Also, I was running a preclear on a newly acquired (refurbished) drive connected to that HBA. Unfortunately, I am unable to collect diagnostics, even via the console directly on the server (it hangs indefinitely). Via the webGUI, it appears to get stuck on this command: sed -ri 's/^(share(Comment|ReadList|WriteList)=")[^"]+/\1.../' '/sf-unraid-diagnostics-20240118-0618/shares/appdata.cfg' 2>/dev/null This also makes the entire webGUI unresponsive. From the console, I attempted to capture the syslog with this command: cp /var/log/syslog /boot/syslog.txt While this did save something (see attached), it is quite incomplete. It believe it does not show the instigating event(s). Any suggestions on what to do next? syslog-manual.txt Edited January 18 by fritzdis added unraid version Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 There may be other syslog files, look for syslog.1 and/or syslog.2 and post those also if they exist. Quote Link to comment
fritzdis Posted January 18 Author Share Posted January 18 8 minutes ago, JorgeB said: There may be other syslog files, look for syslog.1 and/or syslog.2 and post those also if they exist. They exist. However, I removed the boot drive to copy the first one over to Windows, and I guess it does not remount when reinserted. So I'm not sure how to actually get those additional logs. Quote Link to comment
trurl Posted January 18 Share Posted January 18 Might as well reboot and post diagnostics. Then setup syslog server. Quote Link to comment
fritzdis Posted January 18 Author Share Posted January 18 I was able to remount the USB drive. Here are the additional log files. The repeated nginx errors are from leaving the webGUI open overnight by accident. Will reboot for diagnostics in a little while if there are no other suggestions. syslog1.txt syslog2.txt Quote Link to comment
trurl Posted January 18 Share Posted January 18 9 minutes ago, fritzdis said: able to remount the USB drive How did you do this? Quote Link to comment
fritzdis Posted January 18 Author Share Posted January 18 1 minute ago, trurl said: How did you do this? mount /dev/sdh1 /boot (after determining USB drive must be sdh) Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 The problem affected multiple disks at the same time: Jan 18 04:56:38 SF-unRAID kernel: md: disk1 read error, sector=2483043784 Jan 18 04:56:38 SF-unRAID kernel: md: disk2 read error, sector=2483043784 Jan 18 04:56:38 SF-unRAID kernel: md: disk3 read error, sector=2483043784 Jan 18 04:56:38 SF-unRAID kernel: md: disk4 read error, sector=2483043784 Jan 18 04:56:38 SF-unRAID kernel: md: disk5 read error, sector=2483043784 Jan 18 04:56:38 SF-unRAID kernel: md: disk6 read error, sector=2483043784 Jan 18 04:56:38 SF-unRAID kernel: md: disk7 write error, sector=2483043792 Most likely a power/connection issue, or the controller, but not seeing anything point to it for now. Quote Link to comment
fritzdis Posted January 18 Author Share Posted January 18 Yeah, I figured that was probably the case. What's weird is I ran a parity check a couple days ago without issue, and I can't think of what activity would have even been going on last night to trigger things, other than the preclear on the unattached device. In any case, I'll try to shutdown cleanly (that isn't going well so far), remove the unattached drive, and boot back up for diagnostics. Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 Disk7 will be disabled, see if the emulated disk mounts, rest should be fine. Quote Link to comment
fritzdis Posted January 18 Author Share Posted January 18 (edited) 8 minutes ago, JorgeB said: Disk7 will be disabled, see if the emulated disk mounts, rest should be fine. Was not able to shutdown cleanly. Removed unattached device and booted up. Here are the new diagnostics. All drives are present, with disk 7 disabled as you said. Also, since shutdown was unclean, starting the array would trigger a parity check. However, since I suspect a hardware issue somewhere in the chain, I am hesitant to do much of anything without diagnosing the issue if possible. Unfortunately, I am not able to connect all the drives without the controller card. Also, that card is connected to a KTN-STL3 external enclosure, which means there are multiple potential issues (card, cable, enclosure), so I'm really not sure what my best next step is. sf-unraid-diagnostics-20240118-0829.zip Edit: I would say the card itself may be the most likely issue because I replaced the heatsink. However, as I said, I did run a full parity check after that without issue, so it's less of a sure thing. But if replacing the card entirely seems like the best move, I'm open to that. Edited January 18 by fritzdis additional info Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 I would suggest starting the array to see if the emulated disk7 is mounting, if a read check stars you can cancel. Quote Link to comment
fritzdis Posted January 18 Author Share Posted January 18 5 minutes ago, JorgeB said: I would suggest starting the array to see if the emulated disk7 is mounting, if a read check stars you can cancel. Thanks. Seems to have mounted fine. Syslog from after attached. I guess it won't hurt to run a non-correcting parity check. But since that didn't trigger the issue last time, I'm still worried about how I will assess the hardware situation. sf-unraid-syslog-20240118-1607.zip Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 27 minutes ago, fritzdis said: I guess it won't hurt to run a non-correcting parity check. You could, to see if the hardware is stable before trying to rebuild, because if you try to rebuild now and there are issues again it can disable one or two disks more, and two would be a problem, though usually it's recoverable. Quote Link to comment
fritzdis Posted January 18 Author Share Posted January 18 Yeah, I'll give the check a go, and then I guess I'll try the rebuild. If that succeeds, it's possible there was something about the unassigned drive (Toshiba MG07ACA14TE) that was causing an issue in the external enclosure, so I'll leave that out of the system for a while. 1 Quote Link to comment
Solution fritzdis Posted June 17 Author Solution Share Posted June 17 Very belated update: I'm pretty sure it was the cable connecting to the external enclosure. Reseated it and ran the server for quite a while without the new drive, but eventually, I gave the drive another go. No issues this time. Was able to run a preclear on the Toshiba to test it, build parity on it, and then rebuild one of the data drives to increase space. Running stable after the rebuild for about a week so far. I sometimes have to move the server, so hopefully I just need to remember to carefully check that cable each time. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.