Array unable to gracefully shutdown, read/write speeds ground to a halt. VMs often cause a system freeze requiring a forced shutdown


Recommended Posts

Hi all, 

 

I'm experiencing lots of problems with my array. It's been running smoothly for the better part of 18 months, 12 months on existing hardware. 

 

I'll break down the issues and their symptoms. 

 

1. Slow read and write speeds

- Initially I noticed this when creating a secondary Vdisk for a Windows 10 VM I was using. A 200G vdisk took 10 minutes to format and once formatted, was completely unusable. Activity Monitor monitor showed peak speeds of less than 1MB/s. 

- Parity checks start at a reasonable (albeit much slower than before) speed of 110MB/s before slowing down to 0.5-2MB/s after hitting about 5%

- A CLI copy (using the advanced move and copy plugin) showed an average speed of >1MB/s but very brief and infrequent peaks of 200MB/s

- I couldn't measure SMB copy speeds as it took 10 minutes for the root folder to populate, before crashing my disk speed measurement tool (Black Magic)

 

2. So given the above, the logical step is to do a graceful reboot. I have tried this from the webgui as well as SSH 'sudo reboot -n' but the array never shutdown, requiring a hard reboot using the power button. Not ideal

 

3. I believe this is related to the disk speed issue but my Windows 10 VM has been extremely unreliable. Initially I thought it was a corrupt install so did a fresh one but I'm experiencing the exact same issues. Programs are very slow to start but often fail to start. Occasionally starting a program freezes the VM which in turn crashes Unraid. The web GUI fails to load, I'm unable to SSH into the array. All my docker containers become unresponsive but I'm still able to ping the NIC. 

 

 

I've attached a diagnostics file, although this was taken after a dirty reboot so I'm not entirely sure what will be recorded. 

 

Thank you in advance to anyone offering some support, this has caused a lot of lost sleep.

 

 

Edit: 

Managed a reboot to attempt a memtest which never made it to the memtest GUI, instead it kept kicking me back out to the unraid boot menu.

 

 

hades-diagnostics-20220401-0859.zip

Edited by broadcastthebadger
Added additional details
Link to comment

Your syslog seems to be full of messages of the form:

Mar 31 21:50:03 Hades kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:0b:00.0
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:   device [c0a9:540a] error status/mask=00000001/00006000
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:    [ 0] RxErr    

Not sure of the cause and if they explain your issue, but it is possible they do:

 

BTW:  your Parity Check Tuning plugin is not the latest version.   You want the 2022-03-31 version.

Link to comment
28 minutes ago, itimpi said:

Your syslog seems to be full of messages of the form:

Mar 31 21:50:03 Hades kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:0b:00.0
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:   device [c0a9:540a] error status/mask=00000001/00006000
Mar 31 21:50:03 Hades kernel: nvme 0000:0b:00.0:    [ 0] RxErr    

Not sure of the cause and if they explain your issue, but it is possible they do:

 

BTW:  your Parity Check Tuning plugin is not the latest version.   You want the 2022-03-31 version.

I hadn't noticed that, thank you.

Does that point towards and issue with the nvme or the PCIe bus itself?

 

I've just updated the Parity Check Tuning plugin. 

 

Edit: I've run a SMART test on the drive and it isn't showing any errors. See attached.

Not sure how this explains the slowness in the array itself as this is my cache drive but probably isn't helping

hades-smart-20220401-1327.zip

Edited by broadcastthebadger
Added SMART report for NVME
Link to comment
54 minutes ago, JorgeB said:

It's usually an issue with the board/slot and device combo, it might work better in a different slot if available.

It's been in this slot and functioning for the better part of 12 months, not to say that isn't the issue but unfortunately my motherboard only has the one slot. Might see if there is another way I can test it

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.