Major errors encountered, diagnostics fails, not sure how to proceed


Recommended Posts

Overnight, my server (on 6.12.6) apparently encountered significant errors, to the point where multiple drives became unavailable.  I suspect this may be related to the HBA card to which the drives were connected.  Also, I was running a preclear on a newly acquired (refurbished) drive connected to that HBA.

 

Unfortunately, I am unable to collect diagnostics, even via the console directly on the server (it hangs indefinitely).  Via the webGUI, it appears to get stuck on this command:

sed -ri 's/^(share(Comment|ReadList|WriteList)=")[^"]+/\1.../' '/sf-unraid-diagnostics-20240118-0618/shares/appdata.cfg' 2>/dev/null

 

This also makes the entire webGUI unresponsive.

 

From the console, I attempted to capture the syslog with this command:

cp /var/log/syslog /boot/syslog.txt

 

While this did save something (see attached), it is quite incomplete.  It believe it does not show the instigating event(s).

 

Any suggestions on what to do next?

syslog-manual.txt

Edited by fritzdis
added unraid version
Link to comment
8 minutes ago, JorgeB said:

There may be other syslog files, look for syslog.1 and/or syslog.2 and post those also if they exist.

They exist.  However, I removed the boot drive to copy the first one over to Windows, and I guess it does not remount when reinserted.  So I'm not sure how to actually get those additional logs.

Link to comment

The problem affected multiple disks at the same time:

 

Jan 18 04:56:38 SF-unRAID kernel: md: disk1 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk2 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk3 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk4 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk5 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk6 read error, sector=2483043784
Jan 18 04:56:38 SF-unRAID kernel: md: disk7 write error, sector=2483043792

 

Most likely a power/connection issue, or the controller, but not seeing anything point to it for now.

Link to comment

Yeah, I figured that was probably the case.  What's weird is I ran a parity check a couple days ago without issue, and I can't think of what activity would have even been going on last night to trigger things, other than the preclear on the unattached device.

 

In any case, I'll try to shutdown cleanly (that isn't going well so far), remove the unattached drive, and boot back up for diagnostics.

Link to comment
8 minutes ago, JorgeB said:

Disk7 will be disabled, see if the emulated disk mounts, rest should be fine.

Was not able to shutdown cleanly.  Removed unattached device and booted up.  Here are the new diagnostics.

 

All drives are present, with disk 7 disabled as you said.  Also, since shutdown was unclean, starting the array would trigger a parity check.  However, since I suspect a hardware issue somewhere in the chain, I am hesitant to do much of anything without diagnosing the issue if possible.

 

Unfortunately, I am not able to connect all the drives without the controller card.  Also, that card is connected to a KTN-STL3 external enclosure, which means there are multiple potential issues (card, cable, enclosure), so I'm really not sure what my best next step is.

sf-unraid-diagnostics-20240118-0829.zip

 

Edit: I would say the card itself may be the most likely issue because I replaced the heatsink.  However, as I said, I did run a full parity check after that without issue, so it's less of a sure thing.  But if replacing the card entirely seems like the best move, I'm open to that.

Edited by fritzdis
additional info
Link to comment
5 minutes ago, JorgeB said:

I would suggest starting the array to see if the emulated disk7 is mounting, if a read check stars you can cancel.

 

Thanks.  Seems to have mounted fine.  Syslog from after attached.

 

I guess it won't hurt to run a non-correcting parity check.  But since that didn't trigger the issue last time, I'm still worried about how I will assess the hardware situation.

sf-unraid-syslog-20240118-1607.zip

Link to comment
27 minutes ago, fritzdis said:

I guess it won't hurt to run a non-correcting parity check.

You could, to see if the hardware is stable before trying to rebuild, because if you try to rebuild now and there are issues again it can disable one or two disks more, and two would be a problem, though usually it's recoverable.

Link to comment

Yeah, I'll give the check a go, and then I guess I'll try the rebuild.

 

If that succeeds, it's possible there was something about the unassigned drive (Toshiba MG07ACA14TE) that was causing an issue in the external enclosure, so I'll leave that out of the system for a while.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.