Jump to content
Sign in to follow this  
sonofdbn

Disk in error state - continue parity check?

15 posts in this topic Last Reply

Recommended Posts

This morning I found my server unresponsive - seemed to be on, but couldn't SSH in and no response from pings. So I switched it off and restarted and it automatically went into a parity check. A few minutes later I got a messages for Disk 3: udma crc error count, then a message saying it was in error state, and warning that the array had errors, 1 disk with read errors.

 

I'm assuming I should replace the disk. Is that correct? (Diagnostics attached.) Also, should I stop the parity check? Finally, if I replace this 4TB disk, can I replace it with an 8TB disk (which is the size of my parity disks)? For the replace, do I just put in the new one in place of the old one and rebuild?

tower-diagnostics-20200325-1526.zip

Share this post


Link to post

Disk3 dropped offline, most likely a power/connection problem, but since it dropped there's no SMART report, you should cancel the parity check, check/replace cables and post new diags so we can check SMART.

Share this post


Link to post

Thanks for the quick response. OK, I cancelled the parity check and checked the cables, which seemed to be OK. I restarted but the disk still has an "x" next to it.

 

(Weirdly I get a notification saying the array has turned good, array has 0 disks with read errors. While I get that if the disk is disabled but the array still works, doesn't seem like the state of the array should be termed "good".)

 

If I need to restart the parity check, should I be writing corrections to parity?

 

Diagnostics are attached.

tower-diagnostics-20200325-1911.zip

Share this post


Link to post
8 minutes ago, sonofdbn said:

I restarted but the disk still has an "x" next to it.

That's expected, once a disk is disabled it needs to be rebuilt.

 

Disk looks fine, high number of CRC errors suggests a SATA cable problem, if you don't want to replace it at least swap with another disk to rule it out in case it gets disable again, then and since the emulated disk is mounting correctly you can rebuild on top.

Share this post


Link to post
Just now, johnnie.black said:

at least swap with another disk

Just a warning that this can be a little risky for the rebuild, if there are errors on another disk and you let it finish.

Share this post


Link to post

So should I do another parity check first? Then if that turns out OK, do a rebuild?

Share this post


Link to post

Thanks for all the help. All done and seems to be OK. The disk was connected via a forward breakout cable to a SAS HBA and I was a little worried, having no spare breakout cable. Fortunately I found the last remaining SATA port on the MB and managed to use that with a new SATA cable. I'll watch out for more CRC errors, though.

Share this post


Link to post

Unfortunately the server crashed again. I noticed it when my Win10 VM disconnected (wasn't using the VM, but it was running in an RDP window on my PC). Couldn't SSH in to the server and couldn't ping it. So I've rebooted and parity check started automatically.

 

I've attached the latest diagnostics (after rebooting) and the previous one for easy reference. From what I can tell, on Disk 3 the UDMA CRC Error Count hasn't changed. I'm wondering whether there's anything that can tell what cause the crash. So far, 4.8% into the parity check, there are no errors.

tower-diagnostics-20200331-1439.zip tower-diagnostics-20200325-1911.zip

Share this post


Link to post

Fortunately I ran the syslog server tool as suggested - the server crashed again. This time I could ping, but no SSH. Also I couldn't see any of the shares, but my Win10 VM was still running. Weird? The GUI timed out with a 500 Internal Server Error. So I shut down the VM and then rebooted. 

 

I've attached the syslog, removing entries at the end which came after the reboot. My UPS is down, so those messages are not surprising. 192.168.134 is an Asus router. I have two Asus routers, one is the main one and one is configured as an access point (and also serves as a switch). I believe that internal IP address is the access point. The single Ethernet cable from the server is connected to the access point. unRAID came up again at 00:45

extract-syslog-192.168.1.14.log tower-diagnostics-20200401-0104.zip

Share this post


Link to post

There's a call trace on the syslog but can't say what it is about, I would suggest running the server for a while in safe mode without docker/VMs, if it still crashes like that it's likely a hardware problem, if it doesn't start using plugins and other services a few at a time to see if you can find the culprit.

Share this post


Link to post

Should I let the parity check finish? (The crash came roughly 50% of the way into the previous parity check.) Is there a possibility that the flash drive is corrupted?

 

I did install a docker recently, about the time of the first crash. I haven't started it this time round. Don't want to name it and give it a bad rep when it might have nothing to do with the crashes.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this