Major errors encountered, diagnostics fails, not sure how to proceed

fritzdis · January 18

Overnight, my server (on 6.12.6) apparently encountered significant errors, to the point where multiple drives became unavailable. I suspect this may be related to the HBA card to which the drives were connected. Also, I was running a preclear on a newly acquired (refurbished) drive connected to that HBA.

Unfortunately, I am unable to collect diagnostics, even via the console directly on the server (it hangs indefinitely). Via the webGUI, it appears to get stuck on this command:

sed -ri 's/^(share(Comment|ReadList|WriteList)=")[^"]+/\1.../' '/sf-unraid-diagnostics-20240118-0618/shares/appdata.cfg' 2>/dev/null

This also makes the entire webGUI unresponsive.

From the console, I attempted to capture the syslog with this command:

cp /var/log/syslog /boot/syslog.txt

While this did save something (see attached), it is quite incomplete. It believe it does not show the instigating event(s).

Any suggestions on what to do next?

syslog-manual.txt

Edited January 18 by fritzdis
added unraid version

JorgeB · January 18

There may be other syslog files, look for syslog.1 and/or syslog.2 and post those also if they exist.

fritzdis · January 18

8 minutes ago, JorgeB said:

There may be other syslog files, look for syslog.1 and/or syslog.2 and post those also if they exist.

They exist. However, I removed the boot drive to copy the first one over to Windows, and I guess it does not remount when reinserted. So I'm not sure how to actually get those additional logs.

trurl · January 18

Might as well reboot and post diagnostics. Then setup syslog server.

fritzdis · January 18

I was able to remount the USB drive. Here are the additional log files. The repeated nginx errors are from leaving the webGUI open overnight by accident.

Will reboot for diagnostics in a little while if there are no other suggestions.

syslog1.txt syslog2.txt

trurl · January 18

9 minutes ago, fritzdis said:

able to remount the USB drive

How did you do this?

fritzdis · January 18

1 minute ago, trurl said:

How did you do this?

mount /dev/sdh1 /boot (after determining USB drive must be sdh)

JorgeB · January 18

The problem affected multiple disks at the same time:

Jan 18 04:56:38 SF-unRAID kernel: md: disk1 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk2 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk3 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk4 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk5 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk6 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk7 write error, sector=2483043792

Most likely a power/connection issue, or the controller, but not seeing anything point to it for now.

fritzdis · January 18

Yeah, I figured that was probably the case. What's weird is I ran a parity check a couple days ago without issue, and I can't think of what activity would have even been going on last night to trigger things, other than the preclear on the unattached device.

In any case, I'll try to shutdown cleanly (that isn't going well so far), remove the unattached drive, and boot back up for diagnostics.

JorgeB · January 18

Disk7 will be disabled, see if the emulated disk mounts, rest should be fine.

fritzdis · January 18

8 minutes ago, JorgeB said:

Disk7 will be disabled, see if the emulated disk mounts, rest should be fine.

Was not able to shutdown cleanly. Removed unattached device and booted up. Here are the new diagnostics.

All drives are present, with disk 7 disabled as you said. Also, since shutdown was unclean, starting the array would trigger a parity check. However, since I suspect a hardware issue somewhere in the chain, I am hesitant to do much of anything without diagnosing the issue if possible.

Unfortunately, I am not able to connect all the drives without the controller card. Also, that card is connected to a KTN-STL3 external enclosure, which means there are multiple potential issues (card, cable, enclosure), so I'm really not sure what my best next step is.

sf-unraid-diagnostics-20240118-0829.zip

Edit: I would say the card itself may be the most likely issue because I replaced the heatsink. However, as I said, I did run a full parity check after that without issue, so it's less of a sure thing. But if replacing the card entirely seems like the best move, I'm open to that.

Edited January 18 by fritzdis
additional info

JorgeB · January 18

I would suggest starting the array to see if the emulated disk7 is mounting, if a read check stars you can cancel.

fritzdis · January 18

5 minutes ago, JorgeB said:

I would suggest starting the array to see if the emulated disk7 is mounting, if a read check stars you can cancel.

Thanks. Seems to have mounted fine. Syslog from after attached.

I guess it won't hurt to run a non-correcting parity check. But since that didn't trigger the issue last time, I'm still worried about how I will assess the hardware situation.

sf-unraid-syslog-20240118-1607.zip

JorgeB · January 18

27 minutes ago, fritzdis said:

I guess it won't hurt to run a non-correcting parity check.

You could, to see if the hardware is stable before trying to rebuild, because if you try to rebuild now and there are issues again it can disable one or two disks more, and two would be a problem, though usually it's recoverable.

fritzdis · January 18

Yeah, I'll give the check a go, and then I guess I'll try the rebuild.

If that succeeds, it's possible there was something about the unassigned drive (Toshiba MG07ACA14TE) that was causing an issue in the external enclosure, so I'll leave that out of the system for a while.

fritzdis · June 17

Very belated update:

I'm pretty sure it was the cable connecting to the external enclosure. Reseated it and ran the server for quite a while without the new drive, but eventually, I gave the drive another go. No issues this time. Was able to run a preclear on the Toshiba to test it, build parity on it, and then rebuild one of the data drives to increase space. Running stable after the rebuild for about a week so far.

I sometimes have to move the server, so hopefully I just need to remember to carefully check that cable each time.

Major errors encountered, diagnostics fails, not sure how to proceed

Recommended Posts

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

trurl

Link to comment

fritzdis

Link to comment

trurl

Link to comment

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

fritzdis

Link to comment

Join the conversation