sonofdbn Posted March 25, 2020 Share Posted March 25, 2020 This morning I found my server unresponsive - seemed to be on, but couldn't SSH in and no response from pings. So I switched it off and restarted and it automatically went into a parity check. A few minutes later I got a messages for Disk 3: udma crc error count, then a message saying it was in error state, and warning that the array had errors, 1 disk with read errors. I'm assuming I should replace the disk. Is that correct? (Diagnostics attached.) Also, should I stop the parity check? Finally, if I replace this 4TB disk, can I replace it with an 8TB disk (which is the size of my parity disks)? For the replace, do I just put in the new one in place of the old one and rebuild? tower-diagnostics-20200325-1526.zip Quote Link to comment
JorgeB Posted March 25, 2020 Share Posted March 25, 2020 Disk3 dropped offline, most likely a power/connection problem, but since it dropped there's no SMART report, you should cancel the parity check, check/replace cables and post new diags so we can check SMART. Quote Link to comment
sonofdbn Posted March 25, 2020 Author Share Posted March 25, 2020 Thanks for the quick response. OK, I cancelled the parity check and checked the cables, which seemed to be OK. I restarted but the disk still has an "x" next to it. (Weirdly I get a notification saying the array has turned good, array has 0 disks with read errors. While I get that if the disk is disabled but the array still works, doesn't seem like the state of the array should be termed "good".) If I need to restart the parity check, should I be writing corrections to parity? Diagnostics are attached. tower-diagnostics-20200325-1911.zip Quote Link to comment
JorgeB Posted March 25, 2020 Share Posted March 25, 2020 8 minutes ago, sonofdbn said: I restarted but the disk still has an "x" next to it. That's expected, once a disk is disabled it needs to be rebuilt. Disk looks fine, high number of CRC errors suggests a SATA cable problem, if you don't want to replace it at least swap with another disk to rule it out in case it gets disable again, then and since the emulated disk is mounting correctly you can rebuild on top. Quote Link to comment
JorgeB Posted March 25, 2020 Share Posted March 25, 2020 Just now, johnnie.black said: at least swap with another disk Just a warning that this can be a little risky for the rebuild, if there are errors on another disk and you let it finish. Quote Link to comment
sonofdbn Posted March 25, 2020 Author Share Posted March 25, 2020 So should I do another parity check first? Then if that turns out OK, do a rebuild? Quote Link to comment
JorgeB Posted March 25, 2020 Share Posted March 25, 2020 No, rebuild the disk, and recommend replacing the SATA cable first. Quote Link to comment
sonofdbn Posted March 26, 2020 Author Share Posted March 26, 2020 Thanks for all the help. All done and seems to be OK. The disk was connected via a forward breakout cable to a SAS HBA and I was a little worried, having no spare breakout cable. Fortunately I found the last remaining SATA port on the MB and managed to use that with a new SATA cable. I'll watch out for more CRC errors, though. Quote Link to comment
sonofdbn Posted March 31, 2020 Author Share Posted March 31, 2020 Unfortunately the server crashed again. I noticed it when my Win10 VM disconnected (wasn't using the VM, but it was running in an RDP window on my PC). Couldn't SSH in to the server and couldn't ping it. So I've rebooted and parity check started automatically. I've attached the latest diagnostics (after rebooting) and the previous one for easy reference. From what I can tell, on Disk 3 the UDMA CRC Error Count hasn't changed. I'm wondering whether there's anything that can tell what cause the crash. So far, 4.8% into the parity check, there are no errors. tower-diagnostics-20200331-1439.zip tower-diagnostics-20200325-1911.zip Quote Link to comment
JorgeB Posted March 31, 2020 Share Posted March 31, 2020 Try this, it might catch something if it happens again. Quote Link to comment
sonofdbn Posted March 31, 2020 Author Share Posted March 31, 2020 Fortunately I ran the syslog server tool as suggested - the server crashed again. This time I could ping, but no SSH. Also I couldn't see any of the shares, but my Win10 VM was still running. Weird? The GUI timed out with a 500 Internal Server Error. So I shut down the VM and then rebooted. I've attached the syslog, removing entries at the end which came after the reboot. My UPS is down, so those messages are not surprising. 192.168.134 is an Asus router. I have two Asus routers, one is the main one and one is configured as an access point (and also serves as a switch). I believe that internal IP address is the access point. The single Ethernet cable from the server is connected to the access point. unRAID came up again at 00:45 extract-syslog-192.168.1.14.log tower-diagnostics-20200401-0104.zip Quote Link to comment
JorgeB Posted March 31, 2020 Share Posted March 31, 2020 There's a call trace on the syslog but can't say what it is about, I would suggest running the server for a while in safe mode without docker/VMs, if it still crashes like that it's likely a hardware problem, if it doesn't start using plugins and other services a few at a time to see if you can find the culprit. Quote Link to comment
sonofdbn Posted March 31, 2020 Author Share Posted March 31, 2020 Should I let the parity check finish? (The crash came roughly 50% of the way into the previous parity check.) Is there a possibility that the flash drive is corrupted? I did install a docker recently, about the time of the first crash. I haven't started it this time round. Don't want to name it and give it a bad rep when it might have nothing to do with the crashes. Quote Link to comment
JorgeB Posted March 31, 2020 Share Posted March 31, 2020 17 minutes ago, sonofdbn said: Is there a possibility that the flash drive is corrupted? Unlikely, let the check finish. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.