mvanhooff Posted November 21, 2019 Share Posted November 21, 2019 Hi all, Out of the blue I got a notification that there was something wrong with my cache (stupid that I forgot to take a screenshot). I looked at shares and there were no failures. All my dockers stopped! I saw on Main that logs were 100%. A forum said, just reboot you're system. I did a reboot but my NAS did not power off. I dove into my putty SSH and got it turned off. I Rebooted and tried to start my array but it got stuck, nothing helped.. Again I pressed shut down, and nothing happened (well, unraid stopped, but not my NAS). I rebooted it again via putty and tried starting maintenance mode which worked, now I am doing a parity check but it takes 8 hours. Maybe somebody can already help me with my diagnostics.zip which I was able to run during the parity check (luckily). For some reason there were more in the flash drive within this hour of problems!! Can somebody check what is going on? The problem started with a notification of my cache SSD (which is hardly a month old). Thanks!! tower-diagnostics-20191121-1510.zip tower-diagnostics-20191121-1525.zip tower-diagnostics-20191121-1539.zip tower-diagnostics-20191121-1553.zip Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 NVMe device dropped offline: Nov 21 15:03:30 Tower kernel: nvme nvme0: I/O 399 QID 4 timeout, aborting Nov 21 15:03:30 Tower kernel: nvme nvme0: Abort status: 0x0 Nov 21 15:03:40 Tower kernel: nvme nvme0: I/O 821 QID 2 timeout, aborting Nov 21 15:03:40 Tower kernel: nvme nvme0: Abort status: 0x0 Nov 21 15:03:47 Tower kernel: nvme nvme0: I/O 822 QID 2 timeout, aborting Nov 21 15:03:47 Tower kernel: nvme nvme0: Abort status: 0x0 Nov 21 15:04:00 Tower kernel: nvme nvme0: I/O 399 QID 4 timeout, reset controller Nov 21 15:04:53 Tower kernel: nvme nvme0: I/O 10 QID 0 timeout, reset controller Nov 21 15:05:22 Tower kernel: nvme nvme0: Device not ready; aborting reset Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 Thanks, so the notifications were right. BUT, what does this mean.. it is dead? Or because the logs were full? The only thing extra is that i maybe was downloading a file bigger than my cache memory or something? What should i do now? Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 It was a hardware problem, see if a reboot brings it back online, I only checked the first diags. Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 Aah ok! Like i mentioned, i rebooted the system, via ssh while normal reboot did not turn down de system, only the "website" was not reachable. However, after reboot it still complained about not clean shutdown... Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Just now, mvanhooff said: after reboot it still complained about not clean shutdown... That's normal, after a few seconds hard shutdown is done. Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 I can imagine it is still normal while the system was not completely shut down (but for some reason, the system was not able to do so)! But for some reason, the starting array will not complete! Some extra information: - plugins loaded - shares loaded - dockers kept loading but showed nothing. A few (i guess) hours i updated some dockers.. Should I terminate the parity check? Or should I uncheck the SSD and try to restart the array.. Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 Did another diagnostics and the end says something about dockerd: Nov 21 17:30:52 Tower dhcpcd[1675]: br0: failed to renew DHCP, rebinding Nov 21 17:34:23 Tower avahi-daemon[6330]: Joining mDNS multicast group on interface br-8c356d0b587c.IPv4 with address 172.18.0.1. Nov 21 17:34:23 Tower avahi-daemon[6330]: New relevant interface br-8c356d0b587c.IPv4 for mDNS. Nov 21 17:34:23 Tower avahi-daemon[6330]: Registering new address record for 172.18.0.1 on br-8c356d0b587c.IPv4. Nov 21 17:34:23 Tower kernel: IPv6: ADDRCONF(NETDEV_UP): br-8c356d0b587c: link is not ready Nov 21 17:34:23 Tower avahi-daemon[6330]: Joining mDNS multicast group on interface docker0.IPv4 with address 172.17.0.1. Nov 21 17:34:23 Tower avahi-daemon[6330]: New relevant interface docker0.IPv4 for mDNS. Nov 21 17:34:23 Tower avahi-daemon[6330]: Registering new address record for 172.17.0.1 on docker0.IPv4. Nov 21 17:34:23 Tower kernel: IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready Something to do with de dockers not able to connect or something? Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Try to get the complete diags, or at least the syslog. Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 Ok, the strangest thing happened... I ended the parity check because i wanted to do a clean reboot (despite the 2 times it did not work properly). Eventually after 10-15 minutes, the array started. All my dockers are back, but (not yet) started. It started a parity check itself. I think it is better to let it run now? But still, I don't know where the problems came from! (most recent syslog). The only thing I can think of is that the "usable size; log" was completely full and my docker image is quite big (21gb). Moreover, I think it is better to let deluge and sabnzbd put incomplete downloads on one of my drives straight away instead of first on my cache drive? tower-diagnostics-20191121-1659.zip Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Initial problem was caused by the cache device dropping offline, this can cause filesystem issues with both the cache file system and docker image, but everything looks normal so far on the last diags. Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 (edited) There is the error again: Unraid Cache disk message: 21-11-2019 18:21 Warning [TOWER] - Cache pool BTRFS missing device(s) ADATA_SX6000LNP_2J2820125962 (nvme0n1) Is this the part where the cachedrive is dead!? or loose? ;-)! If so, how can I get it up and running again without loosing all settings? tower-diagnostics-20191121-1725.zip Edited November 21, 2019 by mvanhooff Diagnostics, offline again Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Yes, it dropped offline again, this will cause havoc with anything using it, like dockers. Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 (edited) I will do a force shutdown and see if the cashedrive is loose.. otherwise buy a new one I guess? I have a CA-Backup, so I guess it should be "easy" to get it up and running again? Edited November 21, 2019 by mvanhooff . Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 (edited) Did a normal shutdown (worked!). Took out the PCIe M2 SSD and put it back in. Restarted the NAS and started the array, again it took pretty long! After a while the shares came up and eventually the dockers, but I get a error starting them: Then I tried installing a ramdom docker: The hell?? Then I checked my dashboard: Active, but SMART failure 7 unsafe shutdowns (probably the offline online) and extra information: And also the diagnostics.sys What are the best options for now!? Clean install unraid, but that won't fix the offline online problem I suppose.. could it be that it was softwarematically took offline?? I made a backup off the appdata, flashdrive, vm etc. Can somebody give advise if one suggests to buy a new SSD (must be M2 PCIe to not lose a SATA port (that's what the motherboard manual says)). The SSD goes in to a M2 slot tower-diagnostics-20191121-1857.zip Edited November 21, 2019 by mvanhooff double image Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Cache filesystem or the docker image are likely corrupt, also SMART fail is a big red flag, you should replace it. Quote Link to comment
JorgeB Posted November 21, 2019 Share Posted November 21, 2019 Missed the diags, yep, cache filesystem is corrupt, IMHO not much point in trying to fix, just replace the cache device. Quote Link to comment
danull Posted November 21, 2019 Share Posted November 21, 2019 It looks like a garbage ADATA drive, I wouldn’t be remotely surprised it has failed & agree with johnnie, time to feed it to the woodchipper 😛 Quote Link to comment
mvanhooff Posted November 21, 2019 Author Share Posted November 21, 2019 Ill buy a new one, for now I shut down my NAS. Ill contact the shop tommorow while the ssd is only a month old! Ill buy a kingston 500gb. https://www.alternate.nl/Kingston/A2000-500-GB-SSD/html/product/1568217?lk=15414 Something like this one? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.