November 21, 20196 yr Hi all, Out of the blue I got a notification that there was something wrong with my cache (stupid that I forgot to take a screenshot). I looked at shares and there were no failures. All my dockers stopped! I saw on Main that logs were 100%. A forum said, just reboot you're system. I did a reboot but my NAS did not power off. I dove into my putty SSH and got it turned off. I Rebooted and tried to start my array but it got stuck, nothing helped.. Again I pressed shut down, and nothing happened (well, unraid stopped, but not my NAS). I rebooted it again via putty and tried starting maintenance mode which worked, now I am doing a parity check but it takes 8 hours. Maybe somebody can already help me with my diagnostics.zip which I was able to run during the parity check (luckily). For some reason there were more in the flash drive within this hour of problems!! Can somebody check what is going on? The problem started with a notification of my cache SSD (which is hardly a month old). Thanks!! tower-diagnostics-20191121-1510.zip tower-diagnostics-20191121-1525.zip tower-diagnostics-20191121-1539.zip tower-diagnostics-20191121-1553.zip
November 21, 20196 yr Community Expert NVMe device dropped offline: Nov 21 15:03:30 Tower kernel: nvme nvme0: I/O 399 QID 4 timeout, aborting Nov 21 15:03:30 Tower kernel: nvme nvme0: Abort status: 0x0 Nov 21 15:03:40 Tower kernel: nvme nvme0: I/O 821 QID 2 timeout, aborting Nov 21 15:03:40 Tower kernel: nvme nvme0: Abort status: 0x0 Nov 21 15:03:47 Tower kernel: nvme nvme0: I/O 822 QID 2 timeout, aborting Nov 21 15:03:47 Tower kernel: nvme nvme0: Abort status: 0x0 Nov 21 15:04:00 Tower kernel: nvme nvme0: I/O 399 QID 4 timeout, reset controller Nov 21 15:04:53 Tower kernel: nvme nvme0: I/O 10 QID 0 timeout, reset controller Nov 21 15:05:22 Tower kernel: nvme nvme0: Device not ready; aborting reset
November 21, 20196 yr Author Thanks, so the notifications were right. BUT, what does this mean.. it is dead? Or because the logs were full? The only thing extra is that i maybe was downloading a file bigger than my cache memory or something? What should i do now?
November 21, 20196 yr Community Expert It was a hardware problem, see if a reboot brings it back online, I only checked the first diags.
November 21, 20196 yr Author Aah ok! Like i mentioned, i rebooted the system, via ssh while normal reboot did not turn down de system, only the "website" was not reachable. However, after reboot it still complained about not clean shutdown...
November 21, 20196 yr Community Expert Just now, mvanhooff said: after reboot it still complained about not clean shutdown... That's normal, after a few seconds hard shutdown is done.
November 21, 20196 yr Author I can imagine it is still normal while the system was not completely shut down (but for some reason, the system was not able to do so)! But for some reason, the starting array will not complete! Some extra information: - plugins loaded - shares loaded - dockers kept loading but showed nothing. A few (i guess) hours i updated some dockers.. Should I terminate the parity check? Or should I uncheck the SSD and try to restart the array..
November 21, 20196 yr Author Did another diagnostics and the end says something about dockerd: Nov 21 17:30:52 Tower dhcpcd[1675]: br0: failed to renew DHCP, rebinding Nov 21 17:34:23 Tower avahi-daemon[6330]: Joining mDNS multicast group on interface br-8c356d0b587c.IPv4 with address 172.18.0.1. Nov 21 17:34:23 Tower avahi-daemon[6330]: New relevant interface br-8c356d0b587c.IPv4 for mDNS. Nov 21 17:34:23 Tower avahi-daemon[6330]: Registering new address record for 172.18.0.1 on br-8c356d0b587c.IPv4. Nov 21 17:34:23 Tower kernel: IPv6: ADDRCONF(NETDEV_UP): br-8c356d0b587c: link is not ready Nov 21 17:34:23 Tower avahi-daemon[6330]: Joining mDNS multicast group on interface docker0.IPv4 with address 172.17.0.1. Nov 21 17:34:23 Tower avahi-daemon[6330]: New relevant interface docker0.IPv4 for mDNS. Nov 21 17:34:23 Tower avahi-daemon[6330]: Registering new address record for 172.17.0.1 on docker0.IPv4. Nov 21 17:34:23 Tower kernel: IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready Something to do with de dockers not able to connect or something?
November 21, 20196 yr Author Ok, the strangest thing happened... I ended the parity check because i wanted to do a clean reboot (despite the 2 times it did not work properly). Eventually after 10-15 minutes, the array started. All my dockers are back, but (not yet) started. It started a parity check itself. I think it is better to let it run now? But still, I don't know where the problems came from! (most recent syslog). The only thing I can think of is that the "usable size; log" was completely full and my docker image is quite big (21gb). Moreover, I think it is better to let deluge and sabnzbd put incomplete downloads on one of my drives straight away instead of first on my cache drive? tower-diagnostics-20191121-1659.zip
November 21, 20196 yr Community Expert Initial problem was caused by the cache device dropping offline, this can cause filesystem issues with both the cache file system and docker image, but everything looks normal so far on the last diags.
November 21, 20196 yr Author There is the error again: Unraid Cache disk message: 21-11-2019 18:21 Warning [TOWER] - Cache pool BTRFS missing device(s) ADATA_SX6000LNP_2J2820125962 (nvme0n1) Is this the part where the cachedrive is dead!? or loose? ;-)! If so, how can I get it up and running again without loosing all settings? tower-diagnostics-20191121-1725.zip Edited November 21, 20196 yr by mvanhooff Diagnostics, offline again
November 21, 20196 yr Community Expert Yes, it dropped offline again, this will cause havoc with anything using it, like dockers.
November 21, 20196 yr Author I will do a force shutdown and see if the cashedrive is loose.. otherwise buy a new one I guess? I have a CA-Backup, so I guess it should be "easy" to get it up and running again? Edited November 21, 20196 yr by mvanhooff .
November 21, 20196 yr Author Did a normal shutdown (worked!). Took out the PCIe M2 SSD and put it back in. Restarted the NAS and started the array, again it took pretty long! After a while the shares came up and eventually the dockers, but I get a error starting them: Then I tried installing a ramdom docker: The hell?? Then I checked my dashboard: Active, but SMART failure 7 unsafe shutdowns (probably the offline online) and extra information: And also the diagnostics.sys What are the best options for now!? Clean install unraid, but that won't fix the offline online problem I suppose.. could it be that it was softwarematically took offline?? I made a backup off the appdata, flashdrive, vm etc. Can somebody give advise if one suggests to buy a new SSD (must be M2 PCIe to not lose a SATA port (that's what the motherboard manual says)). The SSD goes in to a M2 slot tower-diagnostics-20191121-1857.zip Edited November 21, 20196 yr by mvanhooff double image
November 21, 20196 yr Community Expert Cache filesystem or the docker image are likely corrupt, also SMART fail is a big red flag, you should replace it.
November 21, 20196 yr Community Expert Missed the diags, yep, cache filesystem is corrupt, IMHO not much point in trying to fix, just replace the cache device.
November 21, 20196 yr It looks like a garbage ADATA drive, I wouldn’t be remotely surprised it has failed & agree with johnnie, time to feed it to the woodchipper 😛
November 21, 20196 yr Author Ill buy a new one, for now I shut down my NAS. Ill contact the shop tommorow while the ssd is only a month old! Ill buy a kingston 500gb. https://www.alternate.nl/Kingston/A2000-500-GB-SSD/html/product/1568217?lk=15414 Something like this one?
Archived
This topic is now archived and is closed to further replies.