Starting array stuck

mvanhooff · November 21, 2019

Hi all,

Out of the blue I got a notification that there was something wrong with my cache (stupid that I forgot to take a screenshot).

I looked at shares and there were no failures. All my dockers stopped! I saw on Main that logs were 100%.

A forum said, just reboot you're system. I did a reboot but my NAS did not power off. I dove into my putty SSH and got it turned off.

I Rebooted and tried to start my array but it got stuck, nothing helped.. Again I pressed shut down, and nothing happened (well, unraid stopped, but not my NAS).

I rebooted it again via putty and tried starting maintenance mode which worked, now I am doing a parity check but it takes 8 hours.

Maybe somebody can already help me with my diagnostics.zip which I was able to run during the parity check (luckily).

For some reason there were more in the flash drive within this hour of problems!!

Can somebody check what is going on? The problem started with a notification of my cache SSD (which is hardly a month old).

Thanks!!

tower-diagnostics-20191121-1510.zip tower-diagnostics-20191121-1525.zip tower-diagnostics-20191121-1539.zip tower-diagnostics-20191121-1553.zip

JorgeB · November 21, 2019

NVMe device dropped offline:

Nov 21 15:03:30 Tower kernel: nvme nvme0: I/O 399 QID 4 timeout, aborting
Nov 21 15:03:30 Tower kernel: nvme nvme0: Abort status: 0x0
Nov 21 15:03:40 Tower kernel: nvme nvme0: I/O 821 QID 2 timeout, aborting
Nov 21 15:03:40 Tower kernel: nvme nvme0: Abort status: 0x0
Nov 21 15:03:47 Tower kernel: nvme nvme0: I/O 822 QID 2 timeout, aborting
Nov 21 15:03:47 Tower kernel: nvme nvme0: Abort status: 0x0
Nov 21 15:04:00 Tower kernel: nvme nvme0: I/O 399 QID 4 timeout, reset controller
Nov 21 15:04:53 Tower kernel: nvme nvme0: I/O 10 QID 0 timeout, reset controller
Nov 21 15:05:22 Tower kernel: nvme nvme0: Device not ready; aborting reset

mvanhooff · November 21, 2019

Thanks, so the notifications were right. BUT, what does this mean.. it is dead? Or because the logs were full? The only thing extra is that i maybe was downloading a file bigger than my cache memory or something? What should i do now?

JorgeB · November 21, 2019

It was a hardware problem, see if a reboot brings it back online, I only checked the first diags.

mvanhooff · November 21, 2019

Aah ok! Like i mentioned, i rebooted the system, via ssh while normal reboot did not turn down de system, only the "website" was not reachable. However, after reboot it still complained about not clean shutdown...

JorgeB · November 21, 2019

Just now, mvanhooff said:

after reboot it still complained about not clean shutdown...

That's normal, after a few seconds hard shutdown is done.

mvanhooff · November 21, 2019

I can imagine it is still normal while the system was not completely shut down (but for some reason, the system was not able to do so)!

But for some reason, the starting array will not complete! Some extra information:

- plugins loaded

- shares loaded

- dockers kept loading but showed nothing.

A few (i guess) hours i updated some dockers..

Should I terminate the parity check?

Or should I uncheck the SSD and try to restart the array..

mvanhooff · November 21, 2019

Did another diagnostics and the end says something about dockerd:

Nov 21 17:30:52 Tower dhcpcd[1675]: br0: failed to renew DHCP, rebinding
Nov 21 17:34:23 Tower avahi-daemon[6330]: Joining mDNS multicast group on interface br-8c356d0b587c.IPv4 with address 172.18.0.1.
Nov 21 17:34:23 Tower avahi-daemon[6330]: New relevant interface br-8c356d0b587c.IPv4 for mDNS.
Nov 21 17:34:23 Tower avahi-daemon[6330]: Registering new address record for 172.18.0.1 on br-8c356d0b587c.IPv4.
Nov 21 17:34:23 Tower kernel: IPv6: ADDRCONF(NETDEV_UP): br-8c356d0b587c: link is not ready
Nov 21 17:34:23 Tower avahi-daemon[6330]: Joining mDNS multicast group on interface docker0.IPv4 with address 172.17.0.1.
Nov 21 17:34:23 Tower avahi-daemon[6330]: New relevant interface docker0.IPv4 for mDNS.
Nov 21 17:34:23 Tower avahi-daemon[6330]: Registering new address record for 172.17.0.1 on docker0.IPv4.
Nov 21 17:34:23 Tower kernel: IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready

Something to do with de dockers not able to connect or something?

JorgeB · November 21, 2019

Try to get the complete diags, or at least the syslog.

mvanhooff · November 21, 2019

Ok, the strangest thing happened... I ended the parity check because i wanted to do a clean reboot (despite the 2 times it did not work properly).

Eventually after 10-15 minutes, the array started. All my dockers are back, but (not yet) started.

It started a parity check itself. I think it is better to let it run now?

But still, I don't know where the problems came from! (most recent syslog).

The only thing I can think of is that the "usable size; log" was completely full and my docker image is quite big (21gb).

Moreover, I think it is better to let deluge and sabnzbd put incomplete downloads on one of my drives straight away instead of first on my cache drive?

tower-diagnostics-20191121-1659.zip

JorgeB · November 21, 2019

Initial problem was caused by the cache device dropping offline, this can cause filesystem issues with both the cache file system and docker image, but everything looks normal so far on the last diags.

mvanhooff · November 21, 2019

There is the error again:

Unraid Cache disk message: 21-11-2019 18:21

Warning [TOWER] - Cache pool BTRFS missing device(s)
ADATA_SX6000LNP_2J2820125962 (nvme0n1)

Is this the part where the cachedrive is dead!? or loose? ;-)!

If so, how can I get it up and running again without loosing all settings?

tower-diagnostics-20191121-1725.zip

Edited November 21, 2019 by mvanhooff
Diagnostics, offline again

JorgeB · November 21, 2019

Yes, it dropped offline again, this will cause havoc with anything using it, like dockers.

mvanhooff · November 21, 2019

I will do a force shutdown and see if the cashedrive is loose.. otherwise buy a new one I guess? I have a CA-Backup, so I guess it should be "easy" to get it up and running again?

Edited November 21, 2019 by mvanhooff
.

mvanhooff · November 21, 2019

Did a normal shutdown (worked!). Took out the PCIe M2 SSD and put it back in. Restarted the NAS and started the array, again it took pretty long!

After a while the shares came up and eventually the dockers, but I get a error starting them:

image.png.7474b55b7ffe4976a55e49a6aa306c9c.png

Then I tried installing a ramdom docker:

The hell??

Then I checked my dashboard:

image.png.304b9b33dfae831a7ca1031d8bd8db05.png

Active, but SMART failure

image.png.12c32a7af9dad604a61b141496ead34f.png

7 unsafe shutdowns (probably the offline online)

and extra information:

image.png.900eb5ca05eba086ae8b52df35ae83db.png

And also the diagnostics.sys

What are the best options for now!?

Clean install unraid, but that won't fix the offline online problem I suppose.. could it be that it was softwarematically took offline??

I made a backup off the appdata, flashdrive, vm etc.

Can somebody give advise if one suggests to buy a new SSD (must be M2 PCIe to not lose a SATA port (that's what the motherboard manual says)). The SSD goes in to a M2 slot

tower-diagnostics-20191121-1857.zip

Edited November 21, 2019 by mvanhooff
double image

JorgeB · November 21, 2019

Cache filesystem or the docker image are likely corrupt, also SMART fail is a big red flag, you should replace it.

JorgeB · November 21, 2019

Missed the diags, yep, cache filesystem is corrupt, IMHO not much point in trying to fix, just replace the cache device.

danull · November 21, 2019

It looks like a garbage ADATA drive, I wouldn’t be remotely surprised it has failed & agree with johnnie, time to feed it to the woodchipper 😛

mvanhooff · November 21, 2019

Ill buy a new one, for now I shut down my NAS. Ill contact the shop tommorow while the ssd is only a month old! Ill buy a kingston 500gb.

https://www.alternate.nl/Kingston/A2000-500-GB-SSD/html/product/1568217?lk=15414

Something like this one?

Starting array stuck

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation