Array unable to gracefully shutdown, read/write speeds ground to a halt. VMs often cause a system freeze requiring a forced shutdown

broadcastthebadger · April 1, 2022

Hi all,

I'm experiencing lots of problems with my array. It's been running smoothly for the better part of 18 months, 12 months on existing hardware.

I'll break down the issues and their symptoms.

1. Slow read and write speeds

- Initially I noticed this when creating a secondary Vdisk for a Windows 10 VM I was using. A 200G vdisk took 10 minutes to format and once formatted, was completely unusable. Activity Monitor monitor showed peak speeds of less than 1MB/s.

- Parity checks start at a reasonable (albeit much slower than before) speed of 110MB/s before slowing down to 0.5-2MB/s after hitting about 5%

- A CLI copy (using the advanced move and copy plugin) showed an average speed of >1MB/s but very brief and infrequent peaks of 200MB/s

- I couldn't measure SMB copy speeds as it took 10 minutes for the root folder to populate, before crashing my disk speed measurement tool (Black Magic)

2. So given the above, the logical step is to do a graceful reboot. I have tried this from the webgui as well as SSH 'sudo reboot -n' but the array never shutdown, requiring a hard reboot using the power button. Not ideal

3. I believe this is related to the disk speed issue but my Windows 10 VM has been extremely unreliable. Initially I thought it was a corrupt install so did a fresh one but I'm experiencing the exact same issues. Programs are very slow to start but often fail to start. Occasionally starting a program freezes the VM which in turn crashes Unraid. The web GUI fails to load, I'm unable to SSH into the array. All my docker containers become unresponsive but I'm still able to ping the NIC.

I've attached a diagnostics file, although this was taken after a dirty reboot so I'm not entirely sure what will be recorded.

Thank you in advance to anyone offering some support, this has caused a lot of lost sleep.

Edit:

Managed a reboot to attempt a memtest which never made it to the memtest GUI, instead it kept kicking me back out to the unraid boot menu.

hades-diagnostics-20220401-0859.zip

Edited April 1, 2022 by broadcastthebadger
Added additional details

itimpi · April 1, 2022

Your syslog seems to be full of messages of the form:

Mar 31 21:50:03 Hades kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:0b:00.0
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:   device [c0a9:540a] error status/mask=00000001/00006000
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:    [ 0] RxErr

Not sure of the cause and if they explain your issue, but it is possible they do:

BTW: your Parity Check Tuning plugin is not the latest version. You want the 2022-03-31 version.

broadcastthebadger · April 1, 2022

28 minutes ago, itimpi said:
Your syslog seems to be full of messages of the form:
Mar 31 21:50:03 Hades kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:0b:00.0
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:   device [c0a9:540a] error status/mask=00000001/00006000
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:    [ 0] RxErr    
Not sure of the cause and if they explain your issue, but it is possible they do:

BTW: your Parity Check Tuning plugin is not the latest version. You want the 2022-03-31 version.

I hadn't noticed that, thank you.

Does that point towards and issue with the nvme or the PCIe bus itself?

I've just updated the Parity Check Tuning plugin.

Edit: I've run a SMART test on the drive and it isn't showing any errors. See attached.

Not sure how this explains the slowness in the array itself as this is my cache drive but probably isn't helping

hades-smart-20220401-1327.zip

Edited April 1, 2022 by broadcastthebadger
Added SMART report for NVME

itimpi · April 1, 2022

4 minutes ago, broadcastthebadger said:

I hadn't noticed that, thank you.

Does that point towards and issue with the nvme or the PCIe bus itself?

I do not know myself - hopefully someone else knows the answer and will chime in

JorgeB · April 1, 2022

It's usually an issue with the board/slot and device combo, it might work better in a different slot if available.

broadcastthebadger · April 1, 2022

54 minutes ago, JorgeB said:

It's usually an issue with the board/slot and device combo, it might work better in a different slot if available.

It's been in this slot and functioning for the better part of 12 months, not to say that isn't the issue but unfortunately my motherboard only has the one slot. Might see if there is another way I can test it

itimpi · April 1, 2022

Boards can work loose in slots so it is always worth trying a reseat.

Array unable to gracefully shutdown, read/write speeds ground to a halt. VMs often cause a system freeze requiring a forced shutdown

Recommended Posts

broadcastthebadger

Link to comment

itimpi

Link to comment

broadcastthebadger

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

broadcastthebadger

Link to comment

itimpi

Link to comment

Join the conversation