[SOLVED] Sudden problems when starting VMs

May 27, 20206 yr

I've suddenly started having problems whenever I try to launch a VM. My server was rock solid for well over a year until now.

When I launch a VM, shortly after the guest OS boots, the web UI shows 6 or 7 CPU threads (random ones, not necessarily the ones pinned to the VM) spiking and staying at redline. Seconds later, the web UI becomes unresponsive, the entire system locks up, and you've got no choice but to power-cycle.

I've attached my diagnostic file and a couple of persistent log files that may be helpful. I have noticed these worrying messages:

May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 9 QID 7 timeout, aborting
May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 10 QID 7 timeout, aborting
May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 192 QID 10 timeout, aborting
May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 193 QID 10 timeout, aborting

May 26 16:05:11 ANDRAS4 kernel: nvme nvme0: I/O 9 QID 7 timeout, reset controller

May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Device not ready; aborting reset
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 558795512
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 588290920
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 586574048
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 573818160
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7

Do these messages indicate my cache SSD is failing? The VM domains are on the cache, set to "prefer."

Thanks for the assistance!

andras4-diagnostics-20200526-2048.zip syslog-1589603478 syslog-1590534581 syslog-1590535014 syslog-1590541990 syslog-1590542150 syslog-1590546809

Quote

May 27, 20206 yr

This can sometimes help with those NVMe errors:

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append"

nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference.

Quote

May 28, 20206 yr

Author

19 hours ago, johnnie.black said:
This can sometimes help with those NVMe errors:

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append"
nvme_core.default_ps_max_latency_us=0
Reboot and see if it makes a difference.

Thanks, I've done so. Seems to be working normally now. I'll report back if I have more issues.

Quote

May 28, 20206 yr

Author

Nope, it's still happening. Seems to be that whenever it seems stable with a VM running, all I need to do is start a few docker containers and that will cause it to fail. Basically, heavier I/O causes it.

Quote

May 28, 20206 yr

Look for a bios update or try a different NVMe device.

Quote

June 5, 20206 yr

Author

On 5/28/2020 at 1:14 AM, johnnie.black said:

Look for a bios update or try a different NVMe device.

Thanks johnnie.black. Updating the firmware seemed promising at first, but ultimately made no difference. Since it's clear to me now that this is an issue with the disk itself, I'm going to make a new thread in hardware to get some more opinions.

Quote

June 14, 20206 yr

Author

For posterity, I'd like to report that I solved the problem.

Using the smartctl / nvme, I discovered that even though the disk's main temperature was within the acceptable range, one of the disk's secondary temperature sensors (labeled Temperature Sensor 5) was reading 60-64c at idle and 70c or higher under load. This is most likely the temperature of the controller. Apparently, most nvme controllers begin throttling at 70c, so the I/O errors make sense.

I moved the disk from the M.2 slot to a PCI-e adapter and installed a metal heatsink. The controller now runs 20 degrees cooler on average, and I haven't experienced any more issues.

When it's right behind a graphics card, an M.2 slot is apparently not an ideal thermal environment for an SSD.

Quote

June 14, 20206 yr

Thanks for reporting back, if you don't mind I'm going to tag this solved.

Quote

October 2, 20205 yr

Author

Unfortunately, the issue has resurfaced. It is certainly connected to the temperature of the controller. Moving the SSD away from the graphics cards and installing a heat sink did help, but ultimately did not eliminate the problem. Certain activities can cause the temperature of the controller to spike suddenly, and when it does, it's lights out.

To anyone who reads this in the future, do not buy a Crucial P1 NVME SSD for UnRaid, or for a gaming PC for that matter. It cannot tolerate even moderate heat.

Quote

October 16, 20205 yr

On 10/2/2020 at 1:13 AM, cyberspectre said:

Unfortunately, the issue has resurfaced. It is certainly connected to the temperature of the controller. Moving the SSD away from the graphics cards and installing a heat sink did help, but ultimately did not eliminate the problem. Certain activities can cause the temperature of the controller to spike suddenly, and when it does, it's lights out.

To anyone who reads this in the future, do not buy a Crucial P1 NVME SSD for UnRaid, or for a gaming PC for that matter. It cannot tolerate even moderate heat.

I honestly wish I had found this about 3 days ago when I moved 900GB worth of content to my P5 drive. Will be replacing it with a 970 Plus.

Thank you for posting this.

Quote

November 28, 20205 yr

I'm facing same issues with the Pioneer APS-SE20G-2T_SJ08C21153WL - 2 TB (nvme0n1)

This is driving me nuts. Probably going to replace the drive, thank god still under warranty. What nvme you're using guys?

Quote

November 28, 20205 yr

I have the exact same problem with a

Kingston A2000, 500GB, NVMe, M.2

For my usecase I create 10-12 containers in quick succession and the system is guaranteed to be crashing. If anyone can recommend a drive that for sure does not have this issue I would be grateful, I will return my drive under warranty and probably just use the HDD until I find a solution.

Quote

January 18, 20215 yr

hey,

same problem. initially i had Sabrent 1TB Rocket drives in a pool, issue happened several times, so replaced Sabrents with Samsung 980 PRO 1TB, everything worked fine about 2 weeks and today same issue again, "lost" 1 of drives.

i do have a line nvme_core.default_ps_max_latency_us=0. i see Cache "Not Installed" after reboots.

any suggestions?

Quote

January 20, 20215 yr

On 11/28/2020 at 8:34 PM, voltbit said:
I have the exact same problem with a
Kingston A2000, 500GB, NVMe, M.2
For my usecase I create 10-12 containers in quick succession and the system is guaranteed to be crashing. If anyone can recommend a drive that for sure does not have this issue I would be grateful, I will return my drive under warranty and probably just use the HDD until I find a solution.

Same disk, same issue. Did you find a solution?

Thanks in advance.

Edited January 20, 20215 yr by xabi

Quote

February 20, 20215 yr

Apparently I am having the same issue with a Crucial P5 2TB PCIe M.2 2280SS SSD.

I saw some blurb on the device like:

20-02-2021 15:26 Unraid device nvme0n1 messageNotice [TOWER] - device nvme0n1 returned to normal temperatureCT2000P5SSD8_20362A61D26D (nvme0n1) normal

and

Feb 20 15:25:30 Tower kernel: nvme nvme0: I/O 41 QID 10 timeout, aborting 
Feb 20 15:25:47 Tower kernel: nvme nvme0: I/O 1 QID 0 timeout, reset controller 
Feb 20 15:26:00 Tower kernel: nvme nvme0: I/O 41 QID 10 timeout, reset controller 
Feb 20 15:26:54 Tower kernel: nvme nvme0: Device not ready; aborting reset 
Feb 20 15:26:54 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 0 
Feb 20 15:26:54 Tower kernel: nvme nvme0: Abort status: 0x7 
Feb 20 15:27:00 Tower kernel: nvme nvme0: Device not ready; aborting reset 
Feb 20 15:27:00 Tower kernel: nvme nvme0: Removing after probe failure status: -19 
Feb 20 15:27:05 Tower emhttpd: error: ckmbr, 2030: Input/output error (5): read: /dev/nvme0n1 
Feb 20 15:27:05 Tower emhttpd: import 30 cache device: (nvme0n1) CT2000P5SSD8_20362A61D26D 
Feb 20 15:27:05 Tower emhttpd: import flash device: sdb 
Feb 20 15:27:05 Tower kernel: nvme nvme0: Device not ready; aborting reset 
Feb 20 15:27:05 Tower kernel: Buffer I/O error on dev nvme0n1, logical block 0, async page read 
Feb 20 15:27:05 Tower kernel: nvme nvme0: failed to set APST feature (-19)

So which ones will work?

Thanks

-jim

Quote

February 24, 20215 yr

Author

Months after replacing the Crucial drive with a Samsung 970 Evo, I'm pleased to say this issue has not happened since. Not even once. The Samsung doesn't miss a beat.

Quote

[SOLVED] Sudden problems when starting VMs

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)