[SOLVED] Sudden problems when starting VMs


Recommended Posts

I've suddenly started having problems whenever I try to launch a VM. My server was rock solid for well over a year until now.

 

When I launch a VM, shortly after the guest OS boots, the web UI shows 6 or 7 CPU threads (random ones, not necessarily the ones pinned to the VM) spiking and staying at redline. Seconds later, the web UI becomes unresponsive, the entire system locks up, and you've got no choice but to power-cycle.

 

I've attached my diagnostic file and a couple of persistent log files that may be helpful. I have noticed these worrying messages:

 

May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 9 QID 7 timeout, aborting
May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 10 QID 7 timeout, aborting
May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 192 QID 10 timeout, aborting
May 26 16:04:41 ANDRAS4 kernel: nvme nvme0: I/O 193 QID 10 timeout, aborting
May 26 16:05:11 ANDRAS4 kernel: nvme nvme0: I/O 9 QID 7 timeout, reset controller
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Device not ready; aborting reset
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 558795512
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 588290920
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 586574048
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7
May 26 16:07:15 ANDRAS4 kernel: print_req_error: I/O error, dev nvme0n1, sector 573818160
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7
May 26 16:07:15 ANDRAS4 kernel: nvme nvme0: Abort status: 0x7

Do these messages indicate my cache SSD is failing? The VM domains are on the cache, set to "prefer."

 

Thanks for the assistance!

andras4-diagnostics-20200526-2048.zip syslog-1589603478 syslog-1590534581 syslog-1590535014 syslog-1590541990 syslog-1590542150 syslog-1590546809

Link to comment

This can sometimes help with those NVMe errors:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append"

 

nvme_core.default_ps_max_latency_us=0

 

Reboot and see if it makes a difference.

Link to comment
19 hours ago, johnnie.black said:

This can sometimes help with those NVMe errors:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append"

 


nvme_core.default_ps_max_latency_us=0

 

Reboot and see if it makes a difference.

Thanks, I've done so. Seems to be working normally now. I'll report back if I have more issues.

Link to comment
  • 2 weeks later...
On 5/28/2020 at 1:14 AM, johnnie.black said:

Look for a bios update or try a different NVMe device. 

Thanks johnnie.black. Updating the firmware seemed promising at first, but ultimately made no difference. Since it's clear to me now that this is an issue with the disk itself, I'm going to make a new thread in hardware to get some more opinions.

Link to comment
  • 2 weeks later...

For posterity, I'd like to report that I solved the problem.

 

Using the smartctl / nvme, I discovered that even though the disk's main temperature was within the acceptable range, one of the disk's secondary temperature sensors (labeled Temperature Sensor 5) was reading 60-64c at idle and 70c or higher under load. This is most likely the temperature of the controller. Apparently, most nvme controllers begin throttling at 70c, so the I/O errors make sense.

 

I moved the disk from the M.2 slot to a PCI-e adapter and installed a metal heatsink. The controller now runs 20 degrees cooler on average, and I haven't experienced any more issues.

 

When it's right behind a graphics card, an M.2 slot is apparently not an ideal thermal environment for an SSD.

Link to comment
  • JorgeB changed the title to [SOLVED] Sudden problems when starting VMs
  • 3 months later...

Unfortunately, the issue has resurfaced. It is certainly connected to the temperature of the controller. Moving the SSD away from the graphics cards and installing a heat sink did help, but ultimately did not eliminate the problem. Certain activities can cause the temperature of the controller to spike suddenly, and when it does, it's lights out.

 

To anyone who reads this in the future, do not buy a Crucial P1 NVME SSD for UnRaid, or for a gaming PC for that matter. It cannot tolerate even moderate heat. 

  • Like 1
Link to comment
  • 2 weeks later...
On 10/2/2020 at 1:13 AM, cyberspectre said:

Unfortunately, the issue has resurfaced. It is certainly connected to the temperature of the controller. Moving the SSD away from the graphics cards and installing a heat sink did help, but ultimately did not eliminate the problem. Certain activities can cause the temperature of the controller to spike suddenly, and when it does, it's lights out.

 

To anyone who reads this in the future, do not buy a Crucial P1 NVME SSD for UnRaid, or for a gaming PC for that matter. It cannot tolerate even moderate heat. 

I honestly wish I had found this about 3 days ago when I moved 900GB worth of content to my P5 drive.  Will be replacing it with a 970 Plus.

 

Thank you for posting this.

Link to comment
  • 1 month later...

I have the exact same problem with a

Kingston A2000, 500GB, NVMe, M.2

For my usecase I create 10-12 containers in quick succession and the system is guaranteed to be crashing. If anyone can recommend a drive that for sure does not have this issue I would be grateful, I will return my drive under warranty and probably just use the HDD until I find a solution.

Link to comment
  • 1 month later...

hey, 

same problem.  initially i had Sabrent 1TB Rocket drives in a pool, issue happened several times, so replaced Sabrents with Samsung 980 PRO 1TB,  everything worked fine about 2 weeks and today same issue again, "lost" 1 of drives. 

i do have a line nvme_core.default_ps_max_latency_us=0.  i see Cache "Not Installed" after reboots.
 

any suggestions?

Link to comment

 

On 11/28/2020 at 8:34 PM, voltbit said:

I have the exact same problem with a



Kingston A2000, 500GB, NVMe, M.2

For my usecase I create 10-12 containers in quick succession and the system is guaranteed to be crashing. If anyone can recommend a drive that for sure does not have this issue I would be grateful, I will return my drive under warranty and probably just use the HDD until I find a solution.

 

Same disk, same issue. Did you find a solution?

 

Thanks in advance.

Edited by xabi
Link to comment
  • 1 month later...

Apparently I am having the same issue with a Crucial P5 2TB PCIe M.2 2280SS SSD.

 

I saw some blurb on the device like:

20-02-2021 15:26 Unraid device nvme0n1 messageNotice [TOWER] - device nvme0n1 returned to normal temperatureCT2000P5SSD8_20362A61D26D (nvme0n1) normal

and

Feb 20 15:25:30 Tower kernel: nvme nvme0: I/O 41 QID 10 timeout, aborting 
Feb 20 15:25:47 Tower kernel: nvme nvme0: I/O 1 QID 0 timeout, reset controller 
Feb 20 15:26:00 Tower kernel: nvme nvme0: I/O 41 QID 10 timeout, reset controller 
Feb 20 15:26:54 Tower kernel: nvme nvme0: Device not ready; aborting reset 
Feb 20 15:26:54 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 0 
Feb 20 15:26:54 Tower kernel: nvme nvme0: Abort status: 0x7 
Feb 20 15:27:00 Tower kernel: nvme nvme0: Device not ready; aborting reset 
Feb 20 15:27:00 Tower kernel: nvme nvme0: Removing after probe failure status: -19 
Feb 20 15:27:05 Tower emhttpd: error: ckmbr, 2030: Input/output error (5): read: /dev/nvme0n1 
Feb 20 15:27:05 Tower emhttpd: import 30 cache device: (nvme0n1) CT2000P5SSD8_20362A61D26D 
Feb 20 15:27:05 Tower emhttpd: import flash device: sdb 
Feb 20 15:27:05 Tower kernel: nvme nvme0: Device not ready; aborting reset 
Feb 20 15:27:05 Tower kernel: Buffer I/O error on dev nvme0n1, logical block 0, async page read 
Feb 20 15:27:05 Tower kernel: nvme nvme0: failed to set APST feature (-19)

 

So which ones will work?

Thanks

-jim

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.